DOCOMEUT RESUME 



ED 078 877 



LI 004 412 



AUTHOR 
TITLE 

INSTITUTION 
PUB DATE 
NOTE 

AVAILABLE FROM 



EDRS PRICE 
DESCRIPTORS 



Fiorello^ Marco R. 

Management and Design Tools for Document Retrieval 
Systems: A Method for Predicting Quantity Output. 
Rand Corp., Santa Monica, Calif. 
Mar 73 

221p.;(153 References) 

The Rand Corporation, 1700 Main St., Santa Monica, 
Calif. 90406 (S5.00) 

MF-$0.65 HC-$9.87 

4" Design; Information Processing; ^Information 
Retrieval; '•'Information Storage; ^cxntormation 
Systems; ^Management; Search Strategies 

ABSTRACT 

The existing volume and increasing growth rate of 
documented information ha% resulted in^ numerous efforts to construct 
operational Document storage and Retrieval Systems, as jsl practical 
solution to the demand for information storage and retrieval. 
AccoiRpahying the surge to build more and better and bigger Document 
Retrieval Systems (DRSs) , was the realization that there are few 
effective tools for the designers and managers of these systems. The 
tasks of design and management of DRSs requires tools and performance 
measures to aid in the selection of preferred opticms, and in the 
control over the fundamental processes of inquiry analysis. Indexing, 
retrieval and system growth. A step toward the generation of 
operational tools to aid in the design and management tasks is 
presented in this report, by the development of a Retrieval Qu€uitity 
|Rq) estimate. The Rq estimate is defined as a function of the 
Miquiry form, search strategy and descriptor-document distribution, 
and can be used to predict the quantity output due to system groirth, 
and aid in the tuning of the indexing and formal inquiry 
specification processes. . (Author/NH) 
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ABSTRACT 

The existing volume and increasing growth rate of documented in- 
formation has resulted in numerous efforts to construct operational 
Document Storage and Retrieval Systems, as a practical solution to the 
demand for information storage and retrieval. Accompanying tha surge 
to build more and better and bigger Document Retrieval Systems (DRSs), 
was the realization that there are few effective tools for the design- 
ers and ma.iagers of these systems. The tasks of design and management 
of DRSs requires tools and performance measures to aid in the selec- 
tion of preferred options, and in the control over the fundamental 

-processes of inquiry analysis, indexing, retrieval and system growth. 
A step toward the generation of operational tools to aid in the 
design and management tasks is presented in this report, by the de- 
velopment of a Retrieval Quantity (R^) estimate. The R^ estimate is 
defined as a function of the inquiry form, search strategy and des- 
criptor-document distribution, and can be used to predict the quantity 
output of an inquiry* measure the impact on quantity output due to 
system growth, and aid in the tuning of the indexing and formal in- 
quiry specification processes. The definition of t'he R^ measure is 
based on the identification of certain canonical forms which charac- 
terize the underlying principles of DRS Indexing and retrieval. The 

. Rq estimate was tested on an operational DRS, and demonstrated high 
prediction accuracy for a variety of typical Inquiries. Though devel- 
oped on a small DRS, the methodology for determining R^ appears to 
hold for a very wide range of system size, subject content and con- 
struction. 
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It is in the nature of the mind to forget and in the na- 
ture of man to worry over his forgetfulness.... 

Bower 

Chapter 1 

INFORMATION STORAGE AND RETRIEVAL: BACK6R0U»MD ISSUES 
1.1 INTRODUCTION 

Man has always employed some means of storing and retrieving informa- 
tion. In ear.Xy tribal or closed society environment man's memory was the 
principal repository of knowledge, the link between successive generations 
and between the discovery of new knowledge and those who would use it. 
The advent of formal speech and recordable languages provided the means for 
accumulation of experience and knowledge in mediums for transmission, 
storage and use by others, in a relatively time independent sense (93). 
As the scope and content of information became more voluminous and complex, 
formal systems were constructed for information storage and retrieval. 

This report is concerned with certain underlying principles that 
characterize a large class of formal information storage and retrieval 
systems. Throughout the discussion that follows, at the risk of termino- 
logical monotony, the term information will be continuously used to des- 
cribe what "it" is that information storage and retrieval systems store 
and retrieve. No definition of information is given, principally be- 
cause there is no generally accepted precise definition available. Des- 
criptively information has been labeled; the essential ingredient of con- 
versation, writing and thought; recorded experience essential for de- 
cisionmaking; the essential link between means and ends; a resource; 
meaningful data; the result of a process on data; and a symbol or signal 
that a system can employ to guide or control its functions (6, 26, 27, 
149). Information, however, is not considered to be knowledge, per se_, or 
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communication. On the other hand, knowledge Is thought of as an organized 
body of Information, and comnunl cation Is viewed as Information transfer. 
The notions become 2ven more confounded when one considers the additional 
(though fuzzy) distinctions between data and information, data and knowl- 
edge, and so on. 

Suffice it to say, that the entitles — data. Information, and knowl- 
edge are different, relative In place and time, and that the basis of dis- 
tinction Is in part rigorously quantitative. (viz.. Information Theory 
(145)) and qualitative (i.f., contemporary, intuitive concepts and usage). 
For this analysis, information is intuitively treated as existing in graphic 
records* (e.g., documents) and to be perceivable by an inquiring mind which 
has a need for information. 

Contemporary society can be viewed as an enormous information gen- 
erating, processing, storage and retrieval mechanism. The problem of over- 
abundance of information is compounded by a seemingly cDltural magpie-like 
behavior which seeks to store the better part of all Information and to 
retrieve it as well (29). There is no accurate census of the literature 
po; ilation, but a number of statistical estimations have been made. 
Oe Solla Price (41) has estimated that 350,000 scientific papers are pub- 
lished annually. Bourne (12, 13) has estimated that there are 30 to 
35,000 journals published annually of which 15,000 are significant,** 
and that the volume of significant papers published throughout the world 
per year is between 900,000 and 2,100,000. Further there are an estimated 

?E 

The specification of graphic records is for the purposes of this 
analysis, and is not meant to imply that written/printed language is the 
only source of information. Other media, often less restrictive, are the 
non-graphic verbal and non-verbal. 

No definition of significance is given. 



3500 abstracting and indexing services in the world (circa 1960). In 
addition to the periodical population there are the monograph and ab- 
stract files. Figure 1.1 illustrates the estimation of the file size 
of books and periodicals in U.S. colleges and universities, and Fig. 
1.2 the file size for U.S. public libraries. An estimate of the num- 
„ber of technical literature abstracts and/or citations produced annually 
throughout the world is given in Fig. 1.3. 

While the per annum volume of periodicals, abstracts and monographs 
is impressive, the estimated growth rates are staggering. DeSolla Price 
(42, 43) has plotted (see Fig. 1.4) the growth of scientific and abstract 
journals published from the oldest surviving periodical* to the year 
2000, and an exponential growth is clearly evident. Hold (67) surveyed 
the growth of the professional literature in economics, electrical eni____ 
gineering, physics, psychology and biology, and al so. observed exponen- 
tial growth characteristics;..his results are shown in Fig. 1.5. Holt 
(67), Brookes (19), and Krauze (79) all suggest that the growth of lit- 
erature in terms of the number of articles and journals is of the form: 

where - total volume of literature (in the field of interest) 
at time t 

= volume of literature at time t 

o 0 • 

r = the growth rate (estimated to result in . doubling every 
10 years). Note: e^^** »2 r = 7 percent per annum 

' * 

Philosophical Transactions of the Royal Society of London (1665). 
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= the statistical error of measurement; assumed to have 
the property Expected Value (e^) = 0 

= the death rate of journals, and the death rate or near- 
usefulness of articles? 

The above analysis, admittedly cursory and nori-rigorous, does imply an 
information control, storage and retrieval problem. All evidence seems 
to say that for any established field there is an abundance of informa- 
tion, and it is growing. 

There is a bonefide need to store a substantial portion of existing 
literature, and there is a need for a physically feasible means of re- 
trieving information that is both economically practical, and time and 
content relevant to the information user. Concern about this information 
handling problem has placed new emphasis on the traditional activities of 
assemblying and coding recorded information, and has resulted in the 
emergence of a new discipline. Information Science, which focuses on the 
analysis and solution of information, storage and retrieval (ISR) prob- 
lems. A variety of systems, processes and techniques has been con- 
structed to cope with many ISR problems, and a typical set of ISR pro- 
cesses and their interactions are illustrated in Fig. 1.6. 

An important subset of ISR systems are document retrieval systems 
(DRS), which, as the title implies, retrieve documents and hence the 
information in them indirectly. This subset of systems, for instance, 
excludes fact retrieval or intelligence retrieval systems. The term 



A great deal of conjecture surrounds the assessment of D*. It 
is believed (Brooks (19)) that it has exponential properties, §ut 
these are very relative to the user and subject in question. 
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document Is used as a generic for information bearing Items — mono- 
graphs/books, periodical articles, abstracts, film, machine coded/ 
readable tape, etc. It Is with this particular class of ISR systems 
that this report Is concerned. 

To date, extensive research has been carried out on various as- 
pects of DRSs. Ostensibly, major efforts* have been made In Index analy- 
ses and evaluation by Cleverdon (30, 31, 32), Taube (135, 136), Gull 
(61), Thome (138) and Swanson (128); user satisfaction by Borko (11), 

Bourne (14, 15), Falrthome (47), Goffman (56), Rees (112, 113) and 

1 

Swets (131, 132); Retrieval Output^relevancy by Barhydt (5), Cuadra 
(36, 37, 38), Doyle (45), Goffman (57), Lancaster (82), and Salton 
(117, 118, 119); and, automatic classification by Litofsky (90); not- 
withstanding these and other efforts, more problems remain unsolved 
than solved In the design, management and evaluation of DRSs. Of par- 
ticular Interest Is the class of problems concerning the estimation 
of the retrieval quantity of DRSs. This particular dimension of DRS 
performance has not been thoroughly analyzed, and no satisfactory op- 
erational solution has been suggested. 

The basic objective of this research Is to develop a methodology 
that will enable designers and managers of DRSs to estimate the quan- 
tity output In response to an Inquiry, prior to the processing of the 
inquiry. A secondary objective is to demonstrate how the estima* 
tion methodology can be used to assess DRS changes over time. 

No attempt is made to be exhaustive, the cited work is Intended 
to be a representative sample of previous efforts by some of the more 
well-known researchers in Information Science. 
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Before proceeding with the derivation of the retrieval quantity (R^) 
measure,^ the context and qualifications of the analysis will be pre- 
sented. In the next chapter the specific class of DRSs for which the 
Rq estimation procedure Is to apply are described, and Chapter 3 pre- 
sents a survey of the many DRS measures of performance to place the R 
measure In perspective. 

Chapter 4 presents a discussion of previous efforts to develop 
an output quantity measure, and also contains a formal description of 
the recommended methodology to develop the estimate. In Chapter 5, 
a description of the experiments performed to evaluate the R^ estima- 
tion procedure, and the results of the experiments are presented. 
Chapter 6 presents a discussion of various applications of the R 
measure to aid In the management and design of DRSs. Also, Appendix A 
contains a glossary of terms to Information Storage and Retrieval ter- 
minology. 
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A man should keep his little brain attic stocked with 
all the furniture that he is likely to use, and the rest he 
can put away in the lunterroom of his library, where he can 
get it if he wants it. 

Sherlock Homes 

>, 

Chapter 2 

COORDINATE INDEX DOCUMENT STORAGE AND RETRIEVAL SYSTEMS: . 
A FORMAL DESCRIPTION 

2.1 DOCUMENT STORAGE AND RETRIEVAL SYSTEMS 

Document Retrieval Systems (DRSs) are a class of Information re- 
trieval systems solely concerned with the subject analysis of document 
content, the storage of a set of official surrogates "defining" docu- 
ment content, and the "mechanical** search of the surrogate set to Iden- 
tify or select those documents most "relevant" to a user*s formal re- 
quest. The basic functions of a DRS are Illustrated In Fig. 2.1. 

Of special Interest to this discussion are systM output, user 
Inquiries, and Index characteristics. Since each of these processes 
and products Is embedded In a system and Is directly Influencea by 
other system comp^nts, a brief review of the major system functions 
will be presented to place following developments In proper system 
perspective. 

2.2 DOCUMENT SELECTION! SIZING THE CaLECTION 

Mention has alrea4y been made of the existing volume .^d growth of 
documented Information, and of the associated problems of researchers, 
students, etc. concerned with keeping abreast of their fields of In- 
terest. 

It Is elementary, however, to note that not all existing Infer- 
nation related to any one subject should be stored In DRSs serving 
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users In that field. As well It Is equally evident, that not all newly 
generated documented literature Is a contribution to that field* and 
uncurbed storage of documents would result In unsatisfactory DRS per- 
formance. From the point of view of the user the quantity and quality 
of systems output would leave much to be desired. From the point of 
view of the manager, the costs of Indexing, analysis and searching would 
be out of balance with the systems effectiveness^. In order to manage 
a document collection, selection criteria are required and document 
filtering are necessary. Simply put, not all documents in a subject 
field should be input into a DRS, and not all documents input in the 
DRS should be stored forever. 

With regard to the issue of document collection size there are 
certain models that have been developed that can aid the DRS designer 
and manager to estimate the nunter of documents or journals that should 
be reviewed to yield a desired number of subject-relevant "documents," 
or conversely to estimate the number of "documents" that are generated 
by a certain nuni>er of journals. The two models have been referred to 
by Leimkuhler (88), as the Bradford Law of Scattering and the Bradford 
Law of Distribution, and as one might suspect are inversely related. 
Bradford (16) first stated the relationship of "documents" to journals, 
as follows: 

if a large collection of papers is ranked in order of de-. 
creasing productivity of papers relevant to a given topic, 
three zones can be markad off such that each zone pro- 
duces one>third of the total of relevant papers. The 
first, the (sic) nuclear zone, contains a smaller number 
of highly productive journals, say n^; the second zone 
contains a larger nmber of moderately productive jour- 
nals, Sdiy np, and the outer zone a still larqer number 
of journals of low productivity, say n-. The Law of 
Scatter states that. 
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n^tngtrtj = l:a:a 

where a 1s a constant. 

In the subject of geophysics, which Bradford analyzed, "a" was 
approximately equal to five. 

Subsequent to Bradford's effort Vickery (141), Kendall (75),Le1ni- 
kuhler (88), Fairthome (49) and Brookes (20) have each made contri- 
butions to the Interpretation and operational Ity of the Bradford Law 
of Scatter. Leimkuhler (88) has shown the Inverse relationship be- 
tween the Law of Scatter (the distribution of the number of. journals 
containing a given fraction of relevant documents) and the Law of 
Distribution (the distribution of document productivity In a collection 
of journals) and has expressed the latter In the following form: 

where F(x) = the cumulative fraction of "documents" In a collection 
of journals on a specific subject 
X = the corresponding fraction of the most productive jour- 
nals In the collection; and 0 ^ x j; 1. 
e = a constant related to the subject field and 
the completeness of the journal collection. 
The above model enables a DRS designer or manager to estimate the 
relationship between the number of documents In the system corpus," 
and the nunber of documents In the population of journals on a spe- 
cific subject. In other words, the Bradford relationships can be 
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used^ to relate the productivity of a collection of journals to the 
population of journals, and aid In the selection of journals to yield 
documents for the corpus. Given a subject field and budget constraints, 
these relationships can aid In the cost/benefit tradeoff between 
budget dollars and the number of documents/ journals to collect. 

2.3 INDEXING " DOCUMENT ANALYSIS AND REPRESENTATION 

For this discussion, Indexing will be defined as the assignment 
of subject content Indicating terms to a document. The purpose of 
the Indexing operation Is to make It possible to search a file of the 
content Indicating terms, that are mapped onto the set of documents, 
as a substitute for searching the document set, and to Identify those 
documents relevant to an Inquiry. Relevant Is used here to mean 
that condition In which the terms used In the Inquiry are also used 
to describe the selected documents. 

It Is of course theoretically possible to review the set of 
documents as opposed to the Index file, but this approach quickly be- 
comes physically and economically Impractical for even moderate col- 
lections (several hundred) of documents. Thus the Index provides a 
manageable set of content Indicating terms and classes to be searched 
In place of the corpus, and provides a vehicle to Identify those docu- 
ments In the corpus most likely to contain the desired Information. 

There Is In fact a spectrum of Indexing philosophies, and asso- 
ciated techniques with various proper names. That they are all related 

Groos (60) has observed a departure from the linear relationship 
In log-log space of the Bradford Law when plotting the Keenan-Atherton 
data for physics. The observed deviation, however, has not been thor- 
oughly evaluated to determine If the cause lay In the assumptions of 
the Bradford Law or In the Incompleteness of the experimental observa- 
tions. 
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or relatable has been discussed by Artandi (3), Bourne (13), Jahoda (71), 
and demonstrated by Foskett (52). Basic to any Indexing process is the 
set of vocabulary terms employed to describe the content of the docu- 
ments. The_set of vocabulary terms constitutes the index language, 
and as well, an Important part of the Inquiry language of DRSs. The 
latter property follows from 1:he fact that once the Index terms have 
been assigned to the set of documents, they are then used to repre- 
sent the documents and become the vehicle to map Inquiries onto the 
corpus . 

Traditionally, subject classification concepts Involve the use of 
formal schemes to organize the subject matter In a predetermined order 
to some prescribed depth of detail. Typically, these traditional class- 
ifications are hierarchical In nature; that Is, there exists among the 
set of descriptors a rather precisely defined relationship of every 
term to every other term. At the other end of the spectrum there are 
the "key word" systems, which In their simplest form have no word 
relationships defined, and usage of ~ and addition to — the descriptor 
vocabulary Is unrestricted. Artandi (3) makes a useful distinction 
between "systems vocabulary" and "lead-In- vocabulary" as a means of 
distinguishing between word indexing and subject indexing. They 
are both methods of representing document content, but they differ 
operationally. By "systems-vocabulary" it is meant the set of terms 
under which document content descriptor entries are made; that is, the 
terms used to index the documents. The "lead-in-vocabulary" of a DRS, 
"is an index referring from terms used in the literature to terms in 
the system vocabulary, (3)." The principle characteristic of word 
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indexing is that descriptors or words are employed as they are found 
in the text of documents to serve as index terms. Thus word indices 
are derived from the documents that are being indexed. 

The key word in context (KWIC) index is an example of word index- 
ing. in its simplest form, involving elementary alphabetical permuta- 
tions of the "key words" in the document titles. 

2.3.1 Coordinate Indexes 

Word indices in which the index terms are manipulated or coordinated 

It 

are called coordinate index systems. Further, those DRSs in which 
the coordination of the descriptors is done in the indexing process 
are called pre- coordinate DRSs. Analogously, those systems in which 
the coordination of the descriptors takes place during the inquiry 
generation process are called post-coordinate DRSs. The pre- and post- 
distinction obviously refer to the temporal occurrences of the event 
of combining descriptor terms. 

The important characteristic of pre-coordinate DRSs is that the 
searching occurs and the inquiries generated, using the terms and their 
combinations the indexor has prepared. There is no additional coordi- 
nation of the descriptors at the time of the inquiry. 

Traditional examples of pre-coordinate systems are the hierarchical 
systems in which a tree structure is employed to define a generic- 
subordinate relationship and the coordinated relationships among the 



s 

As first developed, "coordinated terms" literally implied the 
statistical conjunction of two or more terms. However, the meaning of 
coordinate Index" as used in most post-coordinate-index systems has 
been broadened to Incorporate the full set of Boolean operators, and 
in some Instances even syntactical, semantic and syndetic term- 
relationships. 
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subordinate terms. Figure 2.2 Illustrates a typical hierarchical scheme, 
and some examples are the Library of Congress, Dewey Decimal, and 
Universal Decimal Classification Systems. 

Another class of pre- coordinated systems are facet Indices. A 
facet Is a set of terms which occurs with sufficient frequency In a 
subject field to provide a useful category or facet of tejm for the 
description of documents In that field. A schematic of a facet Index 
Is given In Fig. 2.3. In these systems, the pre-coordlnatlon of the 
descriptor terms occurs at the time the facet Is defined. The concept 
of faceted systems for subject description was first developed by 
Ranganathan (110) In his colon classification scheme. 

Although the above two classes of pre-coordlnate Index systems 
exhibit strong structural properties, there are also pre-coordlnate 
systems which have no hierarchies or proper set structure. Such sys- 
tems essentially consist of a set of descriptors (the vocabulary), and 
a set of Indexing and vocabulary control rules. 

Post-coordinate Index schemes, as noted previously, are exemplified 
by the combination of more or less elemental Index terms at the time 
of Inquiry generation and search Initiation. These systems are adaptive 
In that they can accommodate shallow or deep Indexing as well as simple 
or complex Inquiries. In their earliest form, post-coordinate re- 
trieval systems were known as Uniterm systems, after Taube (135). The 
Uniterm Is a unit or elemental concept, usually a single word, used 
to describe the subject of a document. In many systems, the vocabulary 
Is quite often derived from the text and title of the documents to be 
Indexed, and no control Is applied over the vocabulary or the coordina- 
tion of the descriptor terms. The post-coordinate Index system Is a 
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very versatile scheme and can be adapted to Incorporate a broad set of 
characteristics. Figure 2.4 Illustrates a taxomony of coordinate 
retrieval systems, and various logical extensions to other types of 
Index systems. Of central relevance to this report are the post- 
coordinate Document Retrieval Systems that Incorporate Boolean opera- 
tors In the system language. 

2.4 THE INDEX FILE 

The Index file In a coordinate Index system consists of the 
descriptor/Index vocabulary and the descriptor tracings or assign- 
ments to the documents In the corpus. A sample of an actual Index 
vocabulary for the subject area of Information Science, Is given In 
Fig. 2.5, and a sample of a term frequency of use ranking Is presented 
In Fig. 2.6. 

Of particular Interest are the following characteristics of a 
coordinate Index system file: 

(1) the number of active terms In the vocabulary 

(2) the frequency of use of each term 

(3) the depth of Indexing for the documents In the corpus 

These characteristics are Indicative of the term-document distribution 
In the DRS which Is the basic relationship In these systems. It Is 
Important to realize that all these characteristics are (<ynam1c In na- 
ture. They will change as new Index terms are added, or created out 
of combinations of existing terms, and as new documents are added 
to — and old documents dropped from — the corpus. The Index vocabu- 
lary Is used by the system user to generate. In a post-coordinate sense. 
Inquiries to the DRS. 
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Fig. 2.5 — Index Vocabulary niustrations (from Haron (98)) 
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Fig. 2.6 — Index term list sorted on frequency of use (from 

Haron (98)) 
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2.5 INQUIRY FORHULATIONS 

The fundaiental coii|)onents of Inqulo formulations are— the user's 
need for Information, the system Inquiry vocabulary (the Index file), 
and the system Inquiry gramwr. 

The notion of user need for Information Is principally psychologi- 
cal In nature; It Is very 4ynam1c and directly dependent on the relative 
state of knowledge of the user. The reason for noting the user's need 
at this point Is primarily to Identify the source of the ORS worlcload 
or demand. The expressing of a need for Information, In the terms and 
grannatlcal structure of the system. Is the system Inquiry. It Is 
usually the case that the formal inquiry Is only a partially accurate 
representation of the "real" need on the part of the user. However, 

♦ 

for the purposes of this analysis the formal Inquiry will be taken as 
the complete system workload, as the system output variable of Interest 
Is quantity. The knotty Issues of distinguishing between felt-need, 
expressed request and formal Inquiry and their respective "noise" con- 
tribution to the relevance^ and nonrelevance of systems output are not 
dealt with. 

The fundamental components of the formal inquiry are the descriptor 
terms Incorporated In the Inquiry* and the grammatical operators used 
to "coordinate" the terms. The descriptor terms have been described, 
and the grammar used In DRSs will be discussed next. 

*There has been more analysis related to the concept of relevance- 
Its definitions, measurement and quantification than any other Informa- 
tion Retrieval System characteristic. To mention just a few, see Cooper 
(33,), Barhydt (5), Cuadra and Katter (36 , 37 , 38), Doyle (45), Salton 
(112), Swets (131), Swanson (129), and Naron and Kuhns (97). 
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2.5.1 Inquiry Grawnar 

The operational manner In which the descriptor terms are coordinated 
In an Inquiry Is defined by the system grammar. Of the class of Infor- 
mation Storage and Retlreval systems that this analysis deals with, the 
nature of the gramnar Is quite primitive; only certain explicit opera- 
tions/connections are permitted, between system controlled vocabulary 
terms. In the "coordination" process. 

The formal representation of coordinate retrieval system grainnars 
can take several **^vii6, A comnon representation Is In terms of a logi' 
cal language, for trample, a sentential or proposltlonal calculus. In 
this analysis, the rules of term combinations can be formally represent- 
ed by a Lattice Algebra,^ or Its less general proper subset. Boolean 
Algebra. In the rest of the discussion a Boolean Aloebra structure 
will be assumed. Essentially, the specifications of the reUtlxmshlir 
between two classes of objects Is what Boolean Algebra Is all about. 
Very briefly, this structure, for a defined set T and Its elements 
(A, &,..), Is defined In terms of the following operations. 

Conjunction; C « A*B, the subset or subclass of all Index 
terms or elements of T that are both In the 
subsets of A and B. 

Excellent presentations of Lattice theory are provided In Birkhoff 
(9) and Szasz (134), Applications to ORS theory can be found In Becker 
and Haiyes (6) and Salton (117). 

A Boolean Algebra Is defined as a distributive lattice In which 
each element "a" has a complement defined by Its negation. 



29 



Disjunction; D = A + B, the subset of all index te . or elements 

of T which are either in subset A or subset B. 
Negation;* N = -B or B, the subset of all index terms in T 
which are not in subset B. 

Figure 2.7 illustrates many of the different symbolic and graphical 
notations in use to represent the above logical operations. For this 
analysis the notations "." for conjuncti^. u "+" for disjunction and 
"-" for negation will be employed consistently. 

In sum, the inquiry language (grammar and vocabulary) is the 
vehicle to translate user's information needs into formal system in- 
quiries. Subsequent to the generation of the request, the next step 
is the search and retrieval process. 

2.6 Search Files and Retrieval Process 

A central DRS component is the storage or search file, which con- 
tains the descriptions of corpus documents. This file provides the 
means whereby formal requests are compared with the index descriptions 
of the documents. In a sense, there is an input indexing operation 
(on the documents), and an output indexing operation (on the user's 
request). Given that both requests and documents are represented by 

*There are variations on the operation Negation that can be used 
in DRSs; for example, Pratemegati on— implicit exclusion instead of ex- 
plicit exclusion, Soergel (125); Brouwerian Compliment— the smallest 
set of items that with certainty contains all the NEGATED elements, 
Salton (117); Psuedo Compliment— the largest set of items that with 
certainty contains no NEGATED elements, Salton (117). 

**In actuality, most DRSs have two search files; one for the docu- 
ment descriptor images, and one for the physical storage of the docu- 
ments. Only the former files are of concern here. 
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lists of index terms, the retrieval process consists of matching the 

two lists, and retrieving those documents whose descriptions sufficiently 

overlap or match the inquiry. 

The assignment of index terms to documents can be represented in 
matrix form. A hypothetic assignment of terms to documents is shown 
in Figure 2.8. In this example, the index terms are represented by 
the set T, and the set of documents by D, where T and D also represent 
the power of the respective finite sets and are usually not equal. As 
indicated in the example, the form of the term to document assignment is a 
binary operation, represented by the blank or zero and 1 notation; the 
latter representing assignment. While other assignment operations 
are possible, notably weighted assignments, the more common index opera- 
tions are binary, and will be the type assumed in this analysis. 

The search file can be represented in matrix (DXT) form, with the 
columns constituting the index term profile of the corpus, and the rows 
representing the meirbership of documents to the index term or concept 
sets. There are two basic arrangements for the search file, index term 
on documents (TXD) or the inverted file shown in Figure 2.9, and docu- 
ments on terms (DXT) as shown in Figure 2.8. The DXT arrangement is 
the usual output from the indexing operation, and the TXD (the trans- 
pose of DXT)is the more convenient form for searching and retrieving 
documents. The retrieval process consists of a subject search of the 
document descriptions. Several simple cases of subject searches are 
Illustrated in Figure 2.9. Search request (1) is a simple one de- 
scriptor inquiry, which would retrieve three documents . For this kind 
of search request those documents that belong to index sets defined 
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Fig. 2.8 — DXT matrix — assignment of terms to documents 
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in the inquiry are retrieved, regardless if other index descriptors 
are also assigned to the specific documents. 

There are different retrieval strategies that can be used in coor- 
dinate index DRSs to select inquiry "rel^viint" documents. The two major 
strategies to be considered are direct match and word associations re- 
trieval. The simplest direct match request is the single term inquiry, 
already noted. The next and more common request is the conjunctive 
coordination of two or more terms. These logical product inquiries re- 
quire that the documents retrieved have all the inquiry terms assigned 
as subject descriptors, and the search result is defined as an exclus- 
sive mapping on the search file. That is, only those documents dealing 
with the inquiry "exclusively" are retrieved. Figure 2.10 illustrates 
an exclusive search by logical statements and Venn diagrams. 

A less restrictive direct match request is to disjunctively coor- 
dinate descriptors as a logical sum. In this type of inquiry each term 
is treated as a logical equivalent or synonym of every other term, and 
any document description containing one or more terms is retrieved. 
These logical sum inquiries result in an inclusive mapping on the search 
file. For the same set of inquiry terms, the inclusive search output 
will contain the exclusive search set". An illustration of an inclusive 
search logic is given in Figure 2.10. In general, inquiries will contain 
corrbinations of logical products and sums of index terms, and occasion- 
ally, negation of a term. Term Negation is treated in this analysis as 
the compliment of the logical product operation. 

The second retrieval strategy is word association searching, in 
v^ich the initial inquiry is expanded or broadened so as to retrieve 
more documents in the corpus that are "relevant" to the initial inquiry. 
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Association retrieval techniques are based on the relationships between 
descriptor terms assigned to ttje DRS corpus. There are basically four 
categories of word relationships that can be used as a basis for Inquiry 
term augmentation: (1) Semantic relationships which manifest the meaning 
and context of term*, within a language, (2) Syntatic relationships which 
arise from terms as menters of word classes and with the class relation- 
ships In a structural (granmatlcal) sense, (3) Syndetic relationships 
which measure the manner by which words that are conjunctively co- 
ordinated with a given or base term cross-reference one another, and 
(4) statistical relationships which measure the frequency of occurrence 
of terms In a document. 

For this analysis, only the statistical association will be dis- 
cussed In that It Is the most conwon technique for Inquiry modification. 
The emphasis (In later chapters) will be on their operational defini- 
tion. As implied by the name, statistical term association does not 
address the semantic, syntatic or syndetic connections among terms; 
rather, it views terms as separate isolatable units and is based princi- 
pally on the frequency of terms usage within a given DRS corpus. The 
basic assumption is that, within the context of a given corpus, terms 
which are statistically correlated with one another are presumed to be 
meaningfully associated. Hence the implication is that if terms A and 
B were determined to be associated, for a given corpus, and term A ap- 
pears in a Inquiry that inquiry could be expanded by the disjunctive 
incorporation of term B to term A. The objective of including term B 
is to Increase the likelihood of retrieving a larger set of inquiry 
"relevant" documents from the corpus. 
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2.7 DRS - A BRIEF FORMAL DESCRIPTION 

The above discussions have been basically Informal, and It Is In- 
structive to consider what a formal statement of a DRS consists of. 
The advantages of a formal statement are: (1) that the elemental or 
basic components of the system and their relationships are defined, so 
as to provide a sound basis for Intra-system analysis, and (2) to fa- 
cilitate inter-system structural and operational comparisons. 

it 

Formally, a coordinate index DRS Is defined as consisting of: 

1. A set of distinct documents to be analyzed/Indexed 

D«{d,,...dp} 

2. A set of elementary descriptors/attributes/index terms from 
which compound-descriptors (combinations) can be constructed 

T « {t.|,. ..,tj.} — the elemental set of attributes 

T' « {t.|,...,tj} — the set of terms generic to set T 

and composed of combination of elements 
in T 

3. A set of statements/axlons which connect descriptors with docu- 
ments. This set of statements defines a homomorphic mapping 
between the set of descriptors T and the set of documents D. 
The mapping usually results In a binary set of assignments, 

n 

For extensive definitions of formal systems see Curry, et al. (39), 
and for an excellent dlsoosslon of a formal system definition of DRSs 
see Soergel (125). 
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{TC::>»{D} : DXT (binary) 

but 1t Is not necessarily restricted to 0 or 1 assignments; 
weighted assignments are also possible. 
4. A set of statements (theorems) derived from the axioms and 
the system grammar which define the manner of coordination 
and relationship of descriptors for searching and Inquiry 
specifications. 

2.8 RETRIEVAL SET CHARACTERISTICS 

It follows from the preceding discussions that the properties of 
the retrieval set are a function of three parameters: 

(1) the number of terms and the degree ind type of coordination 
In the Inquiry 

(2) the search strategy — either direct match or word asso- 
ciation 

(3) the DRS DXT distribution — from which all the DRS charac- 
teristics can be derived. 

The retrieval set characteristics are definable In terms of quan- 
tity and quality. The quality measure Is a reflection of the user's 
judgment of the relevance of the retrieved material. The quantity meas- 
ure Is simply the number of documents output In response to the Inquiry* 
and Is the retrieval set characteristic of Interest to this discussion. 

The principle task Is to define the quantity output as a function 
of the above noted parameters; Inquiry* search strategy and the DXT dis- 
tribution. Various hypotheses about the functional relationship and 
the parameters will be presented and analyzed In Chapters 4 arid 5. 
However, before addressing those Issues* a statement of how retrieval 
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quantity 1s related to existing DRS performance measures Is necessary 
to provide additional perspective for the measure as a management and 
design tool. 
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Perfomiance measures like sign posts guide the wi^.... 

Chapter 3 

RETRIEVAL QUANTin AND DRS PERFORMANCE MEASURES 

3.1 INTRODUCTION 

In this section, the need for a Retrieval Quantity (R^^) measure 
will be discussed, and the relationship of the proposed measure with 
other DRS performance measures will be noted. 

The tasks of design and management of DRSs require tools and per- 
formance measures to aid In the selection of preferred candidate options, 
and In the control over the fundamental processes of Inquiry analysis. 
Indexing, retrieval and system output. The designer needs tools that 
reflect the cause-effect relationships between the DRS building blocks 
of thesaurus, corpus and term-document distribution. Before a DRS Is 
built, the design should be assessed and compared to alternative de- 
signs. Existing DRSs require management tools to tune the system to 
meet the needs of the user, and to control the changes In the system 
due to growth In the thesaurus and corpus. Users of DRSs need guide- 
lines to construct and adjust Inquiries to more completely meet their 
Information needs, both In quantity and quality. 

Some of the tools and performance measures are available, and a 
basis for an overall analytic framework also exists, although a rigorous 
systems formulation has yet to be developed. A brief survey of a nunter 
of the measures that can be used for design and management will be pre- 
sented next, and the R relationship to the different measures briefly noted. 
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3.2 MEASURES FOR EVALUATION 

The primary purpose of a DRS Is to cost-effectively over time pro- 
vide the system users with the Information requested when It Is needed. 
The major dimensions of evaluation Implied In this objective statement 
Include: time* cost, flexibility, convenience of use. Information- 
quality, mii Information quantity. 

3.2.1 Response Time 

In general, for Information systems, the dimension of time reflects 
the period to perform an operation such as providing the user with a 
response to an Inquiry.* Lowe (92) and Hityes (63) have Investigated 
various time processing distinctions between dlffev^nt file organiza- 
tions for storage and retrieval operations. Also, It follows that the 
amount of time to process an Inquiry will be proportional to the thesau- 
rus size and term frequency of use distribution. In fact, Webster (145) 
has demonstrated that certain ORS dictionary searching techniques are 
critically affected by the term frequency of use distribution. In many 
ORSs the requests are batch processed, and '-^xn th£ user's point of 
view the response time Is fixed. However, the amount of "processing" 
time Is still of Interest to the system manager. In those systems In 
which there Is an on-line real-time environment, the user, by necessity* 
also becomes acutely aware of processing times. 

One possible way to anticipate required inquiry processing time 
Is to use the Inquiry as a basis for estimating the required search and 

A more restrictive definition of response time Is offered by 
Lancaster and Cllmenson (84) who define It as the average time required 
to obtain a satisfactory response from the system. 
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retrieval operations. The procedure for predicting the retrieval quan- 
tity measure (to be presented In Chapter 4) entails a set of iterative 
steps proportionate to the "complexity" of the Inquiry. Assuming a 
balanced file and dictionary look-up scheme In which each step takes 
approximately the same amount of time to process, by estimating the 
retrieval quantity, and keeping track of the number of Iterations 
required, the user and manager could gauge the Inquiry processing time 
and workload demands, respectively. 

3.2.2 System Costs 

Various recomn«endat1ons have been made for measures of cost- 
effectiveness for Information storage and retrieval systems. Overmeyer 
(105) has published a relatively detailed cost analysis of the American 
Society of Metals System of Vtestem Reserve University. Lancaster (85) 
discusses relevant system factors susceptible to cost-analysis, and sug- 
gests possible tradeoffs between input and output costs and between 
alternative candidate DRSs. Tell (137), Kochen (78), Bryant (23), 
Westat (147) and Lancaster (84), have developed ORS cost-analysis mod- 
els of various degrees of detail. Notwithstanding these efforts, a 
compreJ^ensive operational model for costing still remains to be de- 
veloped. A sound basis for DRS cost analysis appears to exist; for 
example, Lancaster (85) provides a subject relevant framework that 
could be coupled with the concept of opportunity costs and a well- 
developed system analysis setting, as in Fisher (51). It appears, 
however, that standard cost accounting methods cannot be conveniently 
or correctly carried over to DRS operations. As Marron (99) notes. 
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« corpus cf documents Is not really like or anilogous to equipment or 
machinery, particularly with regard to the concept of depreciation or 
mortl nation. Also, the costs and effort of constructing 3 corpus are 
not very sensitive to the deiiiand volume for services. As well, the 
problem of correctly tracing Input and operational cdsts Is particularly 
difficult when there are several Information services performed by the 
system; for example, dissemination, retrieval, abstracting, etc. Also, 
most ORSs operate In a non-market setting In which the users of the sys- 
tem do not "piiy** for the service, and the system does not "compete** 
to provide the service. This situation tends to complicate the costing 
of resources consumed and the estimation of benefits accrued. 

To some degree, the retrieval quantity estimate can aid In the 
costing of Inquiries by using th(* Inquiry processing time estlrate, 
noted above Is muVtlplled by a cost per unit processing time. Also, the cost 
estimation per Inquiry can help the user "balance" his needs with the 
probable system accrued costs. 

3.2.3 System Convenience of Use 

The principal Iss'** In the dimension of convenience of use Is the 
amount of effort that Is required from the system user to Interact with 
the DRS. To some degree the literature on man-machine Interaction has 
some bearing. Certainly the notion of unburdening Is relevant. In- 
vestlgaMons by Saracevic (120), Lancaster (82, 83), and Lesk and Salton 
(86) Indicate that there Is a need for user — se^arch analyst inter- 
action, but there Is no concensus as to whether the Interaction shouU. 
take place before the search or after the retrieval. There Is nc con- 
venlence-of-use measure of what Is efficient user-system Interaction- 
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Clearly, a fundamental parameter is the state of the user's need for 
Information. Martyn and Vickery (100) discuss a number of conditions 
affecting user need, and Voigt (142) has prepared an early (1959) but 
still accurate description of user nei^ds for information. It would 
seem that given a communicable information need, a retrieval quantity 
estimation process can aid in tuning the user's inquiry to the expected/ 
desired size of the response. This notion is discussed further in 
Chapter 6. 

3.2.4 System Flexibility 

Flexibility 15 meant to be a measure of the DRS's capacity for 
positive adaptation. An implemented ORS can only stay successfully 
operational if it is adaptive. Of interest for this measure is what 
do systems have to be adaptive for, and in what ways can this flexi- 
bility be built into the system structure. Ironically, most DRSs are 
justified on the basis of the rapid growth and rate of change of rele- 
vant literature, and yet the systems are designed for the point in time 
when they are implemented, with little regard given to the need for 
flexibility to accommodate system growth. In addition, to the inherent 
growth of the corpus and thesaurus, DRSs should also have a certain 
flexibility to adapt to changing user needs and behavior. One of the 
greatest faults of the traditional library classification schemes is 
the implicit assumption that all library users are counterpart mini- 
models of the classification scheme, and as well will never change. 
A more preferred state is one -n which a DRS would Interact with users 
at different levels of user proficiency, and grow in a controlled sense 
with the incorporation of new material. 
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An Important Impact of growth is that as the corpus and thesaurus 
changes the system output will be different at different points In 
time for the same Inquiry. The retrieval quantity measure can be used 
to gauge the Impact of corpus and thesaurus growth on the DRS output 
quantity, and in this dimension provides a measure of system adapt- 
ability. This application of the retrieval quantity estimate Is dis- 
cussed in Chapter 6. 

3.2.5 Retrieval Quality 

Measures of retrieval quality have by far received the most atten- 
tion of the DRS dimensions of evaluation. By retrieval quality It Is 
meant the relevance, pertinence or correctness of the retrieval docu- 
ment Information tn the user's Information need. 

For any document corpus, only a fraction of the collection will 
contain relevant Information regarding a specific usef Inquiry. For 
example. If there are D documents In the corpus, then only R may be 
relevant to the particular Inquiry. Without the entire set D being re- 
trieved. It Is unlikely that all R relevant documents will be retrieved 
In any one search. Initiated by the Inquiry. Usually, only a fraction 
H of the R relevant documents are retrieved, and by definition M = R-H 
will be missed. Also it is usually the case that a number of I Irrele- 
vant documents will be retrieved by the system in response to the In- 
quiry. Following Vickery (140) these characteristics of a DRS and the 
retrieval set are represented In a two-by-two contingency table as 
shown In Fig. 3.1. For this binary construction, all the D documents 
in the system are accounted for, with respect to the inquiry which gen- 
erated the retrieval set.^ Namely, 
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Relevant 


Not Relevant 




Retrieved 


(Good 

a 

Hits) 


(Bad 

b 

Hits) 


a + b = H 


Not 
Retrieved 


(Bad 
^ Misses) 


(Good 

d 

Misses) 


c + d = M 




a + c = R 


b + d = I 


0 



Fig. 3.1 — Two x two contingency table of an inquiry 

response (140) 
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a documents are good hits because, aCR and aCH 
c documents are bad misses because, cCR and cjlH 
Presuming, of course, in this simple system that it is desirable to re- 
trieve all R relevant documents. Also included in the retrieved set H 
are: 

b documents which are bad hits because, bcl and bcH 
and the remaining, 

d documents are good misses because, del and dffH. 
From the two-by-two contingency table in Fig. 3.1, a plethora of 
retrieval efficiency measures, primarily directed at assessing rele- 
vance/quality, have been derived. Table 3.1 lists a sample of the 
derivable measures. Fundamental to all of these measures are two vari- 
ables — a relevance judgment and the quantity of documents (relevant 
and/or irrelevant) output. The close relationship between output in- 
formation quality and quantity in these measures is clearly evident. 
A predominant characteristic of these measures is that they are all 
designed to be computed after the retrieval operation, and consequently 
are of limited use to predict output or the effect of a system change. 
The retrieval quantity estimate is a step in the direction of develop- 
ing management tools for predicting retrieval output and impacts due 
to system change. 

A review of previous attempts to construct a Retrieval Quantity 
estimate, and the suggested methodology to predict R^, developed in this 
analysis, are presented in the next chapter. ' . 
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Table 3.1 
RETRIEVAL SET MEASURES 



Measure 


Equation (based on 
Figure 3.1) 


Resolution factor (106) 


a+b 
D 


Elimination factor (106) 


c+d 
D 


Pertinency factor (infi) 
(Relevance measure) ^ ' 

Noise factor (106) 


a 
H 

b 
R 


Recall factor (106) 


a 
R 


Omission factor (106) 


r 

^ (Type I error) 


Generality ratio (31, 32)) 
Concentration ratio (47) ) 

Fall out (69) 


D 

b 
I 


Specificity (113) 


d 
I 


Distillation factor (47) 
Discrimination factor (47) 
False acceptance (101) 


ad-bc 
(a+b) (c+d) 

ad-bc 
{a+c){b+d) 

^ (Type II error) 
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Chapter 4 

RETRIEVAL QUANTITY ESTIMATION: LITERATURE REVIEW 
AND PROPOSED METHODOLOGY 

4.1 INTRODUCTION 

The main body of this chapter is concerned with a review of past 
work related to retrieval quantity estimation. The second part of this 
chapter describes the proposed methodology for prediction of output 
quantity. 

4.2 GENERAL CRITIQUE OF PREVIOUS RESEARCH 

Surprisingly there have not been many analyses of the output quan- 
tity of DRSs; the review that follows is quite exhaustive. Though vari- 
ous approaches have been employed, all the research to date on the 
determination of retrieval quantity has. either implicitly or explicitly, 
been based on the assumption that index terms are used as though they 
are independent of one another. The general lack of qualification or 
modification of this assumption has been the rather pervasive Achilles' 
heel of the efforts to date. This is so because index terms do not 
occur or co-occur as though they are independent of one another. To 
assume that they do exhibit independence causes large divergences be- 
tween actual and "theoretical" values of term co-occurrence and output 
quantity. 

The earliest attempt to estimate retrieval quantity appears to have 
been by Bemier (7). in which the following argument is made. For a 
system of D documents, T descriptors* a uniform depth of indexing of t 
descriptors per document, with no two documents possessing an identi- 
cal set of descriptors and indexing being an "essentially-random" 
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assignment process, a n-tenn conjunctively coordinated inquiry has the 
following probability of retrieving at least one document; 

P(Rq i 1) = D(f)" 

This model is quite "hypothetic" due to the very restrictive assump- 
tions which limits its usefulness. First, terms are not assiqned as 
though they are balls being selected randomly from an urn? secondly, 
the depth of indexing distribution of DRSs is anything but uniformt 
and thirdly, for systems of even moderate size the above probabilities 
so small as to provide almost no insight into the retrieval process. 

A more ambitious attempt was made by A. D. Little (1, 2) in which 
a model to predict the average number of documents to be retrieved for 
a given inquiry is constructed. The expected number of documents re- 
trieved is defined as a function of: 

(1) the number of terms coordinated in the inquiry (only con- 
junctive inquiries were used) ~ n 

(2) the number of documents in the '•orpus — D 

(3) the average depth of indexing ~ q 

(4) the frequency of use distribution of index terms — which 

is approximated by a geometric series and incorporated in 
the function by a factor (l-6)/2, 3 < 1 

(5) the term usage distribution for users generating 
inquiries ~S 

(6) the index term correlation for indexing documents 

(the assumption of independent term, usage with a correc- 
tion factor was employed) — S 
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(7) the effect of a system "requestor" to aid in the specification 
of search inquiries (an implicit factor) 

with the resulting function: 



This model, though containing many system and inquiry charucter- 
istics, does not perform very well at all as shown in 4.1. The 
assumption of term-term independence is the principal factor. Also 
the assumption for the inquiry terms selection distribution, while 
not an essential ingredient for determining retrieval quantity, is 
not necessarily the same distribution as for terms used to describe 
documents. 

A more abstract approach is suggested by Switzer (133) who 
employed a term-term distance measure to estimate the elements of the 
term correlation matrix (TXT). Switzer does not estimate the expected 
number of documents to be retrieved for an inquiry, but does note that 
once the term-term couplets are estimated, the logical extension to 
evaluating term combinations in inquiries is possible. The principle 
assumptions in this analysis are: 

(1) the normalized co-occurrences are considered to be probabili- 
ties (a frequency interpretation of probability is implied ) 

(2) the term co-occurrences are hypergeometrically distributed 
The proposed relationship for the value of the couplet of terms a and 
b is: 




for n > 2. 
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Theoretical 

Fig. 4J — Comparison of actual nunber of documents retrieved 
with theoretical number based on assui^)t1on of term* 
term Independency - (from Ref. 2) 
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N 



ab _ 
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which Is the hypergeometrl c distribution with parameters 
= the number of co-occurrences for terms a and b 
0 = the number of documents In the corpus 
= the nuirber of times term a has been used 

a 

= the number of times term b has been used 

Switzer did not empirically test this relationship, but It Is 
clear from the fundamental assumptions of hypergeometrl city, which Is 
random sampling from a finite population without replacement, that It 
Is not correct. As noted previously, term-term co-occurrences do not 
occur as though they are the result of a random sample. 

One of the more Interesting formulations to estimate document 
output Is presented by Raver (111), In which the term frequency of use 
distribution Is approximated by a normalized log function. The explicit 
distinction between a normalized and unnormalized term frequency of use 
distribution is very useful. In addition. Raver notes that all Boolean 
combinations of terms are reducible^definable by the "and" and "or" 
operators with tnose terms. 

The logrithmic relationship between the frequency of term use and 
the term rank (in which the term with greatest use is given rank 1, 
the next most used term rank 2, and so on) is of the form: 
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for a normalized distribution, 

where N' = most frequently used descriptor (normalized) 



T = total number of active descriptor terms out of a the- 
saurus of size t 
r = rank of the term; Oi r s T and Is defined by 

0 when =fN for unnormallzed distributions 

h/fjnin normalized; N' - N/f^^„ 



'Cl fo> 
C^mln 



T when =]1 for normalized distributions 



for unnormallzed distributions 



where f^j^pls the frequency of use of the least used term In the active 
subset of the thesaurus. 

Obviously, In those systems In which f^^^^ = 1, the term frequency 
of use distribution Is automatically normalized. An Illustration of 
the normalized term frequency of use distribution Is given In Fig. 4.2. 

From the above relationship. Raver then shows that 

(a) the average number of documents per descriptor Is: 

(b) the average number of descriptors per document is: 

T 

0 T 

(c) the average number of documents to be retrieved for an n 
term (conjunctive) Inquiry Is: 
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The last relationship given Is also extended to disjunctive C9nt)1na- 
tlons of tenns by noting that, "for r total documents equal to the sum 
of all 'or' terms, ihe expected number of different documents out of J 
documents will be," 

'-•'adjusted ■ - !>'" 

In going thr ^gh the above derivations It becomes clear that terms 

are assumed to be Independently assigned, and that term co-occurrences 

are Independently distributed. Consequently the Raver estimations do 

diverge greatly from actuality. However, this derivation Is unique In 

that the term frequency of use distribution Is, albeit Implicitly, 

assumed .to be of some standard form represented' by a stable class of 

functions — In this case the log function. This notion, as well as 

that of the need to explicitly normalize the term-frequency of use 

versus rank distributions. Is used In the proposed methodology In this 
report. 

A different perspective Is taken by King and Bryant (23) who deal 
with the Issue of quantity output In the context of an overall system 
evaluation scheme. In which relative frequency of Indexing conslstenpy 
(aggregated over the thesaurus, Indexers and corpus) et a point In time 
1$ determinable. As such, the expected number of documents Isr simply 
the number of documents In the file that are relevant to the Inquiry. 
That Is, If a K term Inquiry were submitted with the conjunctive require- 
ment that the retrieved documents be described by all K terms, then 
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the expected number of documents retrieved would be: 

K n^ n« 
ng.ng"! 3 
s.t. ng+n^-K 

where « the portion of the corpus that "should" contain of 
the K terms In the Inquiry 
p2 ■* the relative frequency of a document being Indexed when 

It should not be (a Type II error) 
P3 « the relative frequency of a document being Indexed 

when It should be Indexed by the Inquiry terms 
0 • the number of documents In the corpus. 
The above analysis Is basically dependent on the assumption of 
Independence of Indexing errors^ That Is to say* If the assignment of 
terms to documents Is sufficiently consistent, a norm can be observed 
about which statistical fluctuations will sum to zero — If the Index- 
ing errors are Indeed Independently distributed. A second assumption 
Is that there exists the ability to determine the fraction of the cor- 
pus that should contain (be Indexed by) the terms In the request. 
Neither of these assumptions seems operationally practical, and Is 
rather an awkward basis for determining R^. Clearly, one of the de- 
sirable attributes of an operational estimation process Is that It 
not require unw1e14y computations, or data not readily available. 

Another somewhat different scheme, which Indirectly addresses tiie 
issue of quantity output. Is Investigated by Shumwaiy (113). This pro- 
cedure Involves an estimate of the total number of relevant documents 
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In a corpus, through the use of sampling techniques comnon to probit 
analysis, and then estimating, with appropriate confidence intervals, 
the nunter of documents necessary to output In order to retrieve a 
certain spec'fled quantity of the relevant documents In the corpus. 

The estimation process entails taking an Initial sample For which 
the recall ratio (see F1g. 3.1) Is determined. Then a second sample 
(of the same size) Is taken, and based on the overlap of common "rele- 
vant" docxents an estimate of the total set of relevant documents In 
the corpus Is made. This technique Involves the use of the hyp6rgeo- 
metrlc distribution, and requires that the samples be random.* The 
result of the sample sequence Is used to construct a search character- 
istic curve which measures or reflects the number of documents needed 
to be retrieved In order to get a certain npber of relevant documents. 

Hiederkehr (148) also utilizes the search characteristic curve 
to estimate quantity output, and presents the interesting notion that 
any search strategy has an equivalent series of single stage random 
searches to generate the desired number of relevant documents In the 
corpus. The notion of defining a search Inquiry as a multiple of single 
stage random searches Is very useful, and will be Incorporated in the 
proposed methodology discussed in the next section of this chapter. 

The usefulness of a search characteristic curve is limited by the 
requirements for data sampling, and the judgment consistencyof what 
is or is not relevant to ar arbitrary inquiry. Alsp, the distribution 
characteristics essential to the probit/hypergeometric are not suffi- 
ciently satisfied by a DRS. 

For a more complete discussion of this procedure see Feller (50). 
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In sunmary, the principal efforts to date have not developed an 
operational procedure for estimating that could be useful to a 
manager or designer of a DRS. The common assumption of random assign- 
ments of descriptors to documents or Its equivalent term- term Indepen- 
ncy assumption Is uot satisfied by actual DRSs. In addition, those 
procedures that could, albeit Indirectly, lead to estimates of R^^ re- 
quire an Impractical amount of data and extensive relevance judgments. 

4.3 PROPOSED METHODOLOGY FOR DEVELOPING THE Rg MEASURE 

As noted, previous attempts to construct a retrieval quantity 
measure have. In general, failed to correctly represent the charac- 
teristics of DRS components, and also have not taken advantage of the 
statistical regularity common to certain components of coordinate 
indexed DRSs. 

At the onset of developing an operational tool, it is advantageous 
to indicate the desirable characteristics that the measure should possess. 
Four such characteristics are: 

1. The R. measure should be defined in terms of the basic DRS 
components (or their equivalent distributions). 

2. The measure should use data that is convenient to obtain in 
operational DRS settings, and easy to construct for those 
systems in the design stage. 

3. The value of the measure should be easy to compute. 

4. The measure should possess stability to allow (in the dimen- 
sion it measures) — (a) monitoring of intra-system changes, 
and (b) inter-system comparisons. 
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As a preamble to the specifications of the measure, a brief 
review of the basic DRS components, relationships and characteristics 
win be given. Where a DRS cljaracteristic or relationship is recom- 
mended for incorporation in the measure, a hypothesis will be made 
about the particular system property. In Chapter 5, the various hy- 
potheses stated in this chapter will be analyzed for acceptance or 
rejection. 

4.3.1 Fundamental DRS Relationships 

As noted in an earlier chapter, the basic DRS components are: 

(a) the system corpus ~ D 

(b) the system thesaurus — T 

(c) the term-document distribution — DXT 

The DXT distribution is the basis from which all other DRS charac- 
teristics are derived. For the class of DRSs of interest to this analy- 
sis the DXT matrix is binary, and a hypothetic example is given in 
Fig. 4.3. 

If one arrays the columns in the DXT matrix such that the term 

with the greatest frequency of use is given rank 1, and the second most 

frequently used term given rank 2, and so on, the resulting DXT matrix 

can be represented by the term frequency of use distribution in Fig. 4.2, 

Note that Jthe most frequently used descriptor is assigned N„^^, as the 

max 

highest frequency of use, and the least used (>0) descriptor N 

min 

When Hj^^^ = 1, the frequency-rank distribution is effectively normal- 
ized. However, if H^.^ > 1, as is the c. in certain truncated dis- 
tributions, the distribution can be normaliii-ed by the division of N_,. , 
* min 

as indicated in Fig, 4.2. 
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The term frequency of use distribution is a commonly available 
ORS statistic, and a preferred data source for the estimation process. 
In order to formally describe the term frequency of use distribution 
two hypotheses will be offered: 

I. The term-frequency-of-use versus rank distribution is a de- 
creasing concave (convex) function in ordinal (log-log) space, 
and is closely approximated by the Mandelbrot-Estot Zipf (MEZ) 
distribution. 

The Mandelbrot-Estoup-Zipf* (94, 95, 96, 153) relationship is de- 
fined for the distribution of word frequency in an unrestricted language 
in which the relative probability of occurrence of a word or term is 
defined to be 

P- = K(r. + B)-« 
i 

where R. = the rank of term i that is used N^. times 

P = probability of occurrence of the term i (with rank r.) 
1 "I 

.P»*o 

K = e ; derived from the exponential law for optimum codes; 

-3.t 

for this application e " is a constant to be determined 
empirically 

a = 3^/3; also a constant to be determined empirically. 
The basic form of the MEZ canonical form is illustrated in log-log 
space in Fig. 4.4. For comparison, the more specific Zipf's Law (a spe- 
cial case of the MEZ form) is also indicated. 

The term-frequency of use distribution is a representation of , the 
column marginals of the DXT distribution. Taking the row marginals of 
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Fig. 4.4 — Tern frequency of occurrence versus rank In 

log-log space 
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Fig. 4.5 Typical depth of Indexing dIstHbutlon 
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the DXT matrix yields the depth of indexing distribution. A typical 
depth of indexing distribution is illustrated in Fig. 4.5. This dis- 
tribution displays the assignment of terms to documents, and will be 
referred to again in a later chapter as a source of data to define the 
degree of homogeneity of a DRS corpus. 

An additional DRS characteristic, and one of central relevance 
to the Rq measure, is the term-term (TXT) correlation matrix. This 
matrix is defined as follows: 

(DXT)^DXT = TXT 

The element TXT(i,j) represents the number of co-occurrences of 
term i and j. For example, if term i and j were assigned to the same 
n documents, the value of TXT(i,j) would be n. Another wa^y of defin- 
ing the elements of TXT is that they are the inner product of the i*^ 
column vector of DXT with the j*^ column vector of DXT. Also the matrix TXT 
is a symmetric distribution. 

Having defined the TXT distribution, the second hypothesis about 
the term frequency of use distribution can be made: 

II. The term-term co-occurrence distribution is not generated 
by a process which selects terms for assignment independent 
of one another. 

Since the R^ measure emphasizes quantity, a third hypothesis is 
of interest, and is also based on the data in the TXT distribution: 
III. Terms with the same frequency of use have essentially the 
same statistical characteHstics in the TXT distribution. 



63 



4.3.2 Inquiry Definition and Generation 

The process of inquiry generation is initiated by the system user, 
who upon "experiencing" a need for information, converts that need 
into a "natural language" request, and then interprets (with or without 
the aid of the DRS personnel) the request into a formal DRS inquiry. 
A formal inquiry is defined as consisting of terms from the system 
thesaurus that are coordinated in accordance with the system grammar. 
The rules of coordination to be used in this analysis are defined by 
Boolean Algebra. The explicit operations used for term coordination 
are: Union or logical sum(+). Intersection or logical product (•)» 
and exclusion or logical negation (-). 

The pertinent characteristics of the inquiry are the form (the 
nuiriber of terms and operators by type) arid the frequency of use of the 
terms. The semantic characteristics of the inquiry are not used in 
the Rq determination, as it Is assumed that terms with the same fre- 
quency of occurrence have essentially the same term-term co-occurrence 
characteristics (Hypothesis III above). This assumption, which is 
proven in the next chapter, simplifies the "inquiry generation process 
for developing hypothetic DRS workloads for DRSs in the design stage. 

4.3.3 Inquiry — Retrieval Quc itity Measure Relationship 

The basic variables relating inquiry terms to documents retrieved 

are: 

(1) term-frequency of use (f(i)) 

(2) term-term co-occurrence values (TXT(i,j)) 



64 



For the logical operators of "+", and the following 
relationships hold for elementary two-term inquiries: ■ 



Request 


- Inquiry 


Output Quantity 




1 


f(i) 


T . and T . 

* J 


i-j 


TXT(i,j) 


T, or Tj 


i+j 


f(i)+f(a) - TXT(i,j) 


T. and not T. 


i-j 


f(i) - TXT(i,j) 



Therefore, for all elementary two-term inquiries, knowledge of the 
term frequencies of use and their co-o.ccurrence value is sufficient 
to determine the output quantity. For more complex inquiries in which 
many terms are coordinated the determination of is not so simple. 
It follows, however, that if the single terms in the above example were 
replaced by groups of, say, conjunctively related terms, the same re- 
lationships would hold. For exan^le, given groups E and F with a logi- 
cal product O^p, the following is true: 



Inquiry 


Output Quantity 


E-F 


Oef 


E+F 


V^f-Oef 


E-F 


Oe - Oef 



The above relationships hold for sequences of disjunctively re- 
lated groups of conjunctively coordinated terms, or for conjunctively 
related sequences of disjunctively coordinated terms. It can be shown 
that any retrieval specification (in the proposltional or predicate 
calculus) on the set of thesaurus terms can be represented In disijuoc- 
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tive or conjunctive normal form. A disjunctive normal form is a dis- 
junction of clauses with no repetitiOB of terms within the claases. 
A clause is simply a finite conjunction of terms (where negation is 
defined as a negative conjunction). Also every disjunctive normal 
form has a dual conjunctive normal form (53, 103, 108, 109). Thus, no 
matter how complex, an inquiry can be converted to a string of clauses 
that can be evaluated for quantity output, as per the relationships 
in the above example. The crucial value to determine is the logical 
product. 

4.4 HYPOTHESES FOR RETRIEVAL QUANTITY ESTIMATIONS 

Given the search strategy of direct match, two methods of esti- 
mating the logical product of inquiry terms, and the value of R^ are 
discussed in this section. 

The problem of determining R^ for a multiterm inquiry is illus- 
trated by the following example. For an n te-'n disjunctively coordi- 
nated inquiry, ^"^+12+.. .+T^. the estimate of is: 

Rn = f(l) + f(2) +...+ f(n) - Logical Product/, ^ 

The simplest model for estimating the logical product of two or 
more terms is one that assumes that the descriptor assignment to a docu- 
ment is a random assignment. This case has been noted as being basic- 
ally Incorrect; however, it can be employed as a stepping stone to 
an eventual solution. For this model, the logical product of two or 
more terms is: 
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t 

Logical Product,, . - 

and, R' for a n term disjunctive inquiry, T,+T,+...+T iu 
= f(i) + f(j) f(n) - ^^^^'^^^j^'^^"^ 

It can be shown that the actual value of the logical product and 
the "random-case" values do diverge significantly. However, if one 
makes the hypothesis: 

IV. There exists a stable statistical relationship between the 
actual term-tenn distribution and the hypothetical "random 
case" distribution, 
then the above formulation yieldinq can be modified to yield 
an accurate estimator of R^. From the above hypothesis the proposed 
modification is: 

= Y R ' 
q q 

or what is equivalent 

Logical Product/, , „n 'Yi o /f(l)-f(2)...f(£)) 
(1»2 n) '1,2 n^ ^n-l ' 

This hypothesis will be tested for acceptance or rejection in 
Chapter 5. If the hypothesis is accepted then a very convenient method 
^ for estimating R^ will be available. 

* 

A statistical test of an actual DRS is performed in Chapter 5 to 
demonstrate that the distribution of logical products of terms is not 
equivalent to a "random-distribution." 
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Given that the proportion y proves to be acceptable, the proposed 
utilization of the proportion for multi-term inquiries is illustrated 
in the following example: 

Inquiry: 'Ig'T^'T^ 

Estimation of R^: 



A second model for estimating the logical product of two or more 
terms can be constructed by using the Row (MR) marg'.ials end column 
(CR) marginals, and the total sum (TS) of marginals for the TXT matrix. 
For this model, the expected value of the logical product of two or more 
terms is: 



q 






Logical Product^^g n) = S -"T? 




where MR. 



sum of the term co-occurrences in Row i — for term i 



with terms 1 



> • • « , 



T 




sum of the term co-occurrences in column j — for term j 



with terms 1 



T 



T T 



TS 



I MR^ = I MC. 
k»l k=l ^ 



•I 



and, Rq for 
is: 



an n term disjunctively coordinated inquiry, "'"^.+T.+. ..+T| 
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^(MR.)(MC.) 



Rq » f(i) + f(j) +...+ f(n) - 2, — ^ 



where an analogo'is hypothesis, to model 1, is 



or what is equivalent 



Logical Product/, , „\ = X, , 

\\,c.,...,n) ltd n 



r MR.-MC 



Using experimental data in Chapter 5, the above relationships will 
be tested to determine if they can be accepted or rejected for use as an 
operational tool. 
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All the business of life. Is to endeavor to find out 
what you don't know by what you do 

The Duke of Wellington 
Chapter 5 

THE RETRIEVAL QUANTITY MEASUFtE: EXPERIMENTS AND RESULTS 

5.1 INTRODUCTION 

The purpose of this chapter Is to analyze the various hypotheses 
made, thus far In this report, about the fundamental characteristics 
and relationships of coordinate- Index DRSs, and to construct and test 
an operational R^ estimation model for systems that are established 
or In the design stage. 

In the preceding chapter the following hypotheses about DRSs were 
stated: 

(1) the term-frequency-of-use versus term rank distribution Is 

a monotonlcally decreasing concave function In log-log space, 
and Is closely approximated by the M-E-Z canonical form. 

(2) the term- term co-occurrence distribution Is not generated by 
a process which selects terms for assignment Independent of 
one another; that Is to say* the term co-occurrence distri- 
bution Is not the result of random sampling from the the- 
saurus. 

(3} the co-occurrence value of two terms Is directly proportional 
to a function of the frequencies of use for the tsrms, and 
can be predicted as a function of that factor. 

(4) terms with the same frequency of use have essentially the 
■...same statistical characteristics. That Is, two terms 1 and 
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it with frequencies of use f(1) * f(j) will have approxi- 
mately the same nunnber of co-occurrences with other terms 
In the thesaurus. 
(5) the Retrieval Quantity (R^) of a coordinate index DRS can 
be predicted for formal Inquiries. 

One of the principle alms of this chapter Is the analysis of these 
hypotheses for acceptance or rejection. The required experiments 
and analyses for this task and for the construction of the R^ model 
are discussed next. 

5.2 EXPERIMENTS; SEHING AND DESCRIPTION 

Experiments for the analysis of the above hypotheses were per- 
formed at the Institute of Llbraiy Research Information Processing 
Laboratory at the University of California, Berkeley, California. 
At the time of the experiments, the Laboratory facilities consisted 
of three Sanders CRT-remote on-i; ?e '-.ermlnals to a IBM 360, Model 40, 
128K system. The CRTs had keyboard Input and visual display output, 
and were capable of simultaneous operation. 

The Laboratory system was equipped with three search grannars, 
and eight word association files (including direct match search capa- 
bility). 

The experiments were set to take place over a period of time in 
which the Laboratory DRS corpus and thesaurus were expanded. The 
original plan called for a three-stage growth sequence, out only the 
first and second stages were realiaed. The system characteristics for 
the two stages are tabiulated in Table 5.1, and the term-freqoericy of 
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Table 5.1 
ILR DOCUMENT RETRIEVAL SYSTEM 



Characteristics 


' Stage 1 


Stage 2 


Corpus 
(documents) 


- 300 


400 


Thesaurus 
(terms) 


368 

(348 active) 


393 

(375 actTve) 


Average depth 
of indexing 


14 


12-13 


Average term 
usage 


3-4 


3-4 



Table 5.2 
"ATA BASE SAMPLE 



Characteristics 




Corpus 


102 


Thesaurus 


:^320 




(307 active). 


Average depth 




of indexing 


14 


Average term 




usage 


3-4 



4 
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use versus tern rank distribution for the system, at the end of stage 
two, is shown in Fig. 5.1. Samples of the system thesaurus and term- 
document assignments are included in Appendix B. The DRS corpus is 
composed exclusively of documents on information science, and. can be 
appropriately classified as being ^ogeneous. For a more complete 
description of the Laboratory and its research projects see Maron, et 
al. (98). 

5.2.1 Experiments and Analysis 

The data collection and analysis involved several steps. The 
first consisted of gathering of the DRS responses, over the two stages 
of system growth, for a set of formal inquiries. The second step 
entailed an analysis of a data sample from the DRS term-document dis- 
tribution* and the third, the evaluation of the retrieval quantity 
model. In the next two sections, 5.3 and 5.4, all these steps are 
discussed in detail, and the hypotheses are analyzed for acceptance or 
rejection. . 

STf DOCUMENT RETRIEVAL SYSTEMS - C0M10N CHARACTERISTICS 

In this section, the issues of statistical regularity among co- 
ordinate indexed DRSs, and the data analysis which demonstrates the 
statistical similarity of the test system to other DRSs, of different 
size and subject matter, are discussed. 

c 

A niMber of researchers, Brookes (20), Fairthorne (49), Mandelbrot 

(94, 95, 96), to mention a few, have observed that there are certain 

statistical regularities common to a variety of documentation systems 

and activities. Fairthorne (49), in fact, presents a brief survey of 
* 

See Appendix C for a description of the data sample. 
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this topic. All of these findings revolve around the concept that 
the underlying behavior of DRSs is "hyperbolic" in nature (49). 

Of interest to this analysis are .he characteristics of derived- 
manipulative indexed DRSs that exhibit similar properties, independ- 
ent of systems -^ize and subject matter. The basic relationship for 
DRSs is the index term — document distribution, from which all the 
term-term, and document-document functional relationships can be de- 
rived. Therefore, if the term-document distributions of different DRSs 
can be shown to be statistically similar, or definable by an analytic/ 
canonical form, the argument for statistical regularity among DRSs can 
be accepted. T^he principal vehicle for showing this is the term- 
frequency-of-use distribution. 

5.3.1 The Term-Frequency-of-Use Distribution 

The preferred characteristic to use to determine if there is a 
statistical similarity among DRSs is the term-document (TXD) distri- 
bution. However, the TXD distribution is awkward to deal with and is 
rarely ever published* Thus the strategy taken is to i • surrogate 
distributions; namely, the term- frequency-of -use versus term rank, the 
term usage versus the cumulative -frequency distribution, and the depth 
of indexing distribution. The first two distributions, in particular, 
are readily available from published research and all three distribu- 
tions are ct enient to illustrate. The relationships between these 
distributions and the TXD matrix are illustrated in Fig. 5.2. 

A richer but unfortunately abstruse discussion is given by 
Mandelbrot (94-96). 



TenK 
12... 



Oocunents 



DXTd.j) = 



1 



Frequency 
of 

Occurrence 




0 Depth of Indexing T 



Teni Frequency 
of Usage 



Tern rank 



Log(tenii 
usage) 



Log rank 



Log(tenii 
usage) 



° Cumulative distHbu- 
tlon of utilization 
of thesaurus 



Fig« 5«2 — Illustration ofHrelationships bettireen the term document 
matrix and the term frequent of use vs. rank d1str1bution» and 
the term usage vs. cumulative usage distribution, and the 
depth of indexing distribution 
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Figure 5.3 shows in log-loy space the term frequency distribution 
for the test system sample, the test system, and the three larger DRSs 
investigated by Litofsky (90). All the curves are concave meootonically 
decreasing relationships. The two DRSs investigated by A. D. Little 
(1) are shown in Fig. 5.4, and these systems also display the same con- 
cave monotonlcally decreasing term frequency of use versus rank in log- 
log space. It is important to note that these systems are terrifically 
different in size, and have different subjects for corpus content. 

In addition, Houston and Mall (68) and Wall (143) have analyzed 
some 14 DRSs and plotted their term-frequenpy of use versus the cumu- 
lative percent of thesaurus utilization.* Their plots are reproduced 
in Figs. 5.5 and 5.6. All the systems plotted exhibit a remarkable 
linearity for the postings per term versus the cumulative distribu- 
tion, which lead Houston and Mall to conclude that the number of terms 
T in a system vocabulary varies directly with the log of TU, the total 
niATber of term uses, and has the form: 

T = a Log^Q(TU + b) - c 

ere a = 3300 
b = 10000 
c = 12600 

for values of TU between 10,000 and 1,000,000. As further evidence 
of statistical regularity, the three systems analyzed by Litofsky (90) 
and the ILR systems are plotted in the Houston-Wall dimensions. These 
it ^ - 

Fairthome (49) points out that the two methods of illustration 
are just different ytays of showing the same characteristics. 
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rig,5,4— Term frequency of use versus tc^k for a 10 percent 
sample of the Industrial Collection Syslem investigated 
by A.D. Little (1, 2) 
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Fig. 5.5 — Term usage versus cumulative thesaurus utilizations of 
thesaurus for systems investigated by Houston and Wall (68). 
(See Table 5.3 for systems corresponding to numbered 

curves) 



.J 
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1000 



E 
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II 

X 




2 (P(x)) - cumulative distributions: fraction of terms 
used X or fewer times 



Fig. 5. 6 — Term usage versus cumulative utilization of 
thesaurus for systems investigated by Wall (143) 
(see Table 5.3 for systems corresponding 
to numbered curves) 
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plots are also linear and are shown in Figs. 5.7 and 5.8. The above 
relationship holds quite nicely for the keyword files analyzed by 
Litofsky. The ILR system, however, is too small as its TU is < 10,000, 
and the above constants require adjustment; the form of the relation- 
ship, however, is satisfied. — 

This empirical evidence is even more impressive when one compares 
the range in corpus and thesaurus size, the different subjects covered, 
and the variation in index term utilization. These pertinent system 
characteristics are tabulated in Table 5.3. 

5.3.2 The Term- Frequency-of- Use Canonical Form 

In addition to the graphical interpretation, which implies strong 
statistical stability, a number of efforts have been made to define 
the term-frequency-of-use versus rank relationship analytically. 

The most well-known attempt to '\^ine ^n equation form a general 
relationship between term frequency of nccurrence and term rank is by 
Zipf (152), who suggested the form: _ 

f{r)-r = K 

where K.^,a constant for a particular (large) sample of text in any 
language ^ 
f(r) = the frequency of occurrence of the term with rank r 
r = term rank; a positive Integer. 
This expression is based on empirical observation of free or run- 
ning text, and as noted by Mandelbrot (95) and falrthome (49), it is 
an extension of the earlier work of J. B. Estoup in 1916 and J. Willis 
in 1922. Mandelbrot (94, 95) using communication or information theory 
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1000 




Cumulative Distribution • percent of keywords 
having f or fewer oceurrencee 

Figure Jr-7 

Log-Probablllty Plot of Keyword Distribution (9^) 
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Fig. 5.8 — Log probability of descriptor usage ILR test system 
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as a basis has derived a relationship, between word frequency of use 
and the rank of a word, that Is more general than Zipf's, and of which 
Zipf's is a special case. Because of the various contributors, this 
relationship will be referred to as the Nandelbrot-Estoup-Jipf (MEZ) 
distribution, and has the form: ' 

f(r) =K(rfB)'* 

For B^Oand a=l, the above relationship reduces to Zipf's "Laiw" How- 
ever, Zipf's equation calls for a linear plot of slope minus one in log- 
log space, which Is not satisfied (even with congruent Intercepts) by 
the curves plotted in Figs. 5.3 and 5.4. 

For the purposes of this analysis it will be sufficient to show that 
the MEZ canonical form is close to the actual term-frequency of use 
versus term rank distribution. To illustrate how the parameters K, B 
and o are defined for a DRS (at a certain point in time), the test sys- 
tem characteristics will be used. For the test DRS: 

D = 102 
T = 370 

T' = 307 (the number of active terms in the thesaurus) 
D = 14 (the average depth of indexing) 
f(r=l) = 32 (the frequency of use of the term with rank = 1) 
f(r«300) = 1 (the frequency of use of a term with rank -300) 

Zipf [see Booth (10)] has noted that a term will occur once if 

1.5 > T P(r) ^ 0.5 
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where P(r) = the probability of occurrence of a term with rank r 




I f(r) 



r=l 

T = the total number of term occurrences 
T" 

= I f(r) 
r=l 

The above relationship can be generalized for a term occurring n times* 

(n+ 1/2) > T P(n) i (n - 1/2). 
Substituting the MEZ form for P(n) yields 

(n+ 1/2) > t K'(r + B)"« > (n -1/2). 

For a term with the highest rank, z T', and where B < T' (which 
is always the case— see Mandelbrot (95)), and n = f(T') = 1, the in- 
equality becomes: 

1.5 > t K"(r)"'' > .5 
Because the condition of interest is r_.„, only the right, side of the 

UmIX j 

inequality need be used* Therefore,^ 

T K*(rp = .5 

solving for K* yields 

f 
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Thus, given the number of different or active thesaurus terms, T', 
and the total number of term occurrences, T, one can estimate K' by 
assuming an a, or estimate a assuming a K*. According to Booth (10), 
Zipf (153), and Mandelbrot (93-95), a - 1. Since more is known about 
the range of a than K and all that is needed is a "quick" approximation, 
an a - 1 will be used. With a = 1, 

- (.5H307)' 
= 0.1 

*- 

Note, if f(r) instead of P(r) were being estimated, then 

K z 150. 

With a and K estimated, the next step is to determine B. 

The simplest way to estimate B is at the intercept f(r=1) 
where B is obviously not negligible because r * 1. Solving 

f(r) = K(r +B)-° 

for B, yields. 




For, K = 150, f(r) = 30, r = 1 and o = 1, the estimate for B is 4 to 
4.5 depending on whether a = 1 or 0.9, respectively. 

The comparison of MEZ values and the actual term frequency—rank 
distribution, for the test sample. Is shown In Fig. 5.9 and tabulated 
In Table 5.4. 
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Fig«5,9 — Comparison of ILR test sample term frequency of uses 
versus rank distribution with the canonical form 



90 



Table 5.4 

COMPARISON OF MEZ VALUES WITH ACTUAL TERM USAGE 
VERSUS RANK VALUES FOR THE TEST SAMPLE 



Rank 


ILR Test 
Sample 


MEZa 
yalue.<> 


1 


32 


35.2 


2 


27 


29.9 


3 


26 


26.0 


4 


24 


23.1 


5 


22 


20.8 


6 


21 


18.9 


7 


20 


17.3 


8 


17 


16.0 


9 


16 


14.9. 


10 


15 


14.0 


11 


14 


13.1 


12 


13 


12.4 


13 


12 


11.7 


14 


11 


11.1 


15 


10 


10.6 


* K = 


150; B = 4.5; 


a = 0.9. 



On the basis of this empirical evidence > the hypothesis that the 
term-usage versus rank relationships are closely approximated by the 
MEZ canonical form Is accepted. 

5.3.3 Depth of Indexing Distribution 

The depth of indexing distribution 1s an additional DRS character 
1st1c that can be used to determine statistical similarities between 
DRSs; The distribution Is derived from the DXT distribution (It 1s 
the distribution of the row marginals) as Indicated In Fig. 5.2. 
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The indexing density distributions for the test system and the 
two keyword systems employed by LitofsKy (9ofliPe shown in Figs. 5.10 
and 5.11 respectively. As for the term usage versus rank distribution, 
it would be very desirable to represent the depth of indexing distri- 
bution by a canonical form. While this exercise is not carried out 
here, a suggested canonical form is noted in Chapter 7. 

5.3.4 The Term-Term Co-occurrence Pi stri bution 

The term-term (TXT) matrix is derived from the DXT matrix as shown 
in Fig. 5.12. For the test system, the TXT matrix is quite sparse 
(=82 percent). The non-zero integer entries indicate the number of 
instances in which the two terms, defining the intersection, are used 
as common or co-descriptors for documents in the corpus. 

Three hypotheses have been put forward regarding the character- 
istics of the TXT matrix. Each hypothesis will be stated and then 
analyzed. The first case is: 

5.3.4.1 Term Independency. The TXT matrix is not generated by 
a process which selects terms for assignment independent of one another. 

A prevalent assumption in previous analyses is that the descriptor 
terms in the system thesaurus are assigned independent of one another 
to documents in the corpus. The often stated qualification is that 
while this assumption of independency is not exactly satisfied, it 
is a reasonable approximation. It does not appear that this assumption 
has ever been statistically tested. Perhaps a complicating factor 
is that the convenient chi-square test for goodness of fit is not 
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Fig. 5. 11 — Depth of indexing distribution for 
the systems investigated by Litofsky (90) 
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appropriate In this case. This Is so because the OXT matrix, as de- 
fined for the DRSs of Interest, Is binary and very sparse (I.e., a ma- 
trix condition In which the number of elements whosv value Is zero equals 
or exceeds the number of elements whose value Is near-zero). Thus the 
theoretical limitation of the chl-square test, which requires that 
the expected value of the sample of population elements to be tested 
must be at least equal to 5, Is not satisfied. Hence, the chl-square 
test cannot be used to statistically ascertain whether the DXT matrix 
Is or Is not generated as though the descriptors are assigned Inde- 
pendent of o»»a another. This situation also holds for the TXT matrix. 
Even though there are TXT(1,j) which exceed 5, there are many ele- 
ments that do not be use the TXT matrix Is also sparse;* this neces- 
sarily follows because 

TXT ■ (DXT)^.(DXT), and DXT Is sparse. 

In lieu of the chl-square, the test elected to apply to accept 
or reject the hypothesis Is called the "General Ized-Llkel Ihood-Ratlo- 
Test " (see Mood and Grayblll (102)). The Generalized-Likelihood Ratio 
(GLR) Is defined as the quotient 




where L(s) • the maximum of the likelihood function In the sanyle 
region or space s, with respect to the parameters 
LCd) » the maximum of the likelihood function In the population 
region or space o, with respect to the parmeters 

However, It Is easily shown that TXT Is never more sparse than DXT. 
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and, -2 Log e Is defined as a chl-square vaiiate. 

The null hypothesis of Interest Is that the descriptor terms are 
assigned independent of one another for each document-descriptor set. 
When Is true, -2 Log e Is approximately distributed as chl-square 
Mith N degrees of freedom when M :s large. Thus the null hypothesis 
can be tested by computing -2 Log e and comparing It with the desired 
level of significance of chl square. If -2 Log e exceeds the chl-square 
level, will be rejected, otherwise will be accepted. 

Given the DXT matrix, as Illustrated In Fig. 5.13, the desire Is 
to show that the assignment of any one of the terms In the matrix Is 
Independent of the occurrence of any other term; that Is to say, the 
probability of occurrence of term 1 Is Independent of term j. The 
null hypothesis Is: 

N ni M.n. 

where <■ probability of term 1 occ(irr1ng n^ times 
q< - 

N " the number of documents to be Indexed 
N * the number of terms In the thesaurus 
To test H^, the GLR e Is codyuted, where 

and. 
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where in the context of this problem, n./M are normalized frequencies 
and are taken to be sufficiently representative of the probabilities 
of occurrence of the descriptor terms. Also, 

P = Vector of (P^ P^) 

which maximizes the function L(s). In this case, the empirically ob- 
served frequencies of occurrences or the "best estimates" of the ele- 
ments of P. 

i 

Introducing Logs for ease of computation yields 

A C,,n ^ l^i' ""''^i 

Log L(s) =: I n. Log^+ (M-n.) Log(-f^) 

where the normalized frequencies, f. 

can be substituted, giving 

I I _ Sup N 

Log L (s) - p^j^^ „^ ^^,g ^ j Log(l-f^) 

Now it is necessary to compute, L(6) 

5"P^^"1 "n^ 

Ml 

L(o) = p ^"P i I (P. iJ ^ ^ 

*The Implicit assumption Is that a term can.be assigned only once 
to a document. Therefore, the maximum frequency of use of any term is 
the nuirfcer of documents in the corpus, M. 
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where P i« is the probability that a randomly chosen document 

has descriptor vector n(i^ i^) and is^defined by 

Substituting, and introducing the Log for convenience yields, 

^ Sup - N /n(i, iM)\ 

I-OSW-P, , f ' I n(1, i^) Log (— 

Assuming, that the identical occurrence of n(ip...,i|^) for more 
than a few documents is not a very likely event, then Log L(6) dan be 
simplified as follows 

Log L(o) = I jd(j) Log (J); for j < M 
j=l " 

where K is the maximum number of congruent document vectors, and d(j) 
is the number of descriptor vectors which correspond to exactly j docu- 
ments. In fact, the usual case (of which the test system is an ex* 
ample), K'^l, and the above relationship reduces to 

Log L(o) = M Log g 

Therefore, the expression to be evaluated is: 

The most unlikely event is when the identical c.currences of 
n(ip...>i|^) is M, which means the corpus consists of M "identical" 

items ~ in so far as the thesaurus subject delineation of concepts/ 
subjects is concerned. 
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Log e = I n. Log f. + (M-n.) Log(l-f,) - M Log 1 

and, -2 Log e is the chi square variate of interest with N degrees of 
* 

freedom. Since, for this analysis, N « 370, the normal approximation 
to the chi square distribution is used. 

For the test sample, -2 Log e = 850 which is larger than the nor-, 
mal approximation to the chi square, which at the .005 level, = 480. 
Therefore, the hypothesis of term independency is rejected. 

5.3.4.2 Term-Term Co -occurrence Factor. The next hypothesis to 
test is whether the co-occurrence of two terms is directly proportional 
to a function of the frequencies of use of the terms. 

In Chapter 4, two candidate functions were proposed: 

I. TXT(1.J) .y( f . (<)-fU)) 

n. TXT(1.J).,, '««)^cs(, t ) 

Where f(i) = the frequency of use of term i 
TXT(i,j) = the value of the intersection of term i and j 

RS(i) = the sum of the entries in row i 

CS(i) ■ the sum of the entries in column j 
D = the number of documents indexed. 

The relationships of the above functions and variables and the TXT 
matrix are shown in Fig. 5.12. ' ' 

The variables of interest In the above equations are the y's. 
That is, in order for the estiniations to be useful, the distribution 

* 

The variable is allowed to vary over the range 0 to 1, with 
J *1 > • • • >N • j J ' 
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of values for y must be stable and stationary. Therefore the forms of 
the relationships that will be analyzed are: 



T. - TXT(i.j) _ Actual . 

^ ^ " mViU) ' Theoretical 



0 

ir _ TXT(i,j) _ Actual 

^1 ■ R§(i'}'CS(j) ■ Theoretical 
zRS(i) 

A computer program was written to analyze a sample of the test 
DRS TXD distribution. The program generated the TXT(i,j) for every 
non-zero cell in TXT, computed the values of the candidate function, 
and the ratio of the actual to theoretical values for y and y^. A small 
sample is presented in Table 5.5. It is clearly evident that relationship 
I or r is superior to relationship II or IT. Function II is very un- 
stable (it has a large variance) and it is not suitable as an estimator 
of the value of TXT(i,j). 

On the other hand, function I is very stable. The plot of theo- 
retical Y versus f(i) in log-log space is always linear, and all the 
, theoretical values of y for any f(i) can be determined from a knowledge 
of the relationship of f(l) and the y's for f(l). An illustration of 
this relationship is given in Fig. 5.14. 

The empirical values of y for terms with f(1) = 1 to f(i) = 32,* 
are plotted in Figs. 5.15 to 5.30. As shown, each occurrence or value 
of Y either falls on the theoretical lower bound or lies above it on 



The highest term frequency of use in the data sample. 
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f(i), term frequency of use 



Fig.5.15 y factors for f( |) = 1 , 2, 3, 4 and 
1 <f(i) < 32 
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Fig.5.19 — X factors for f ( f) = 6 and 1 < f({) < 32 
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Fig.5.20 — r factors for f( {) » 7 ond 1 < f(i) S 32 
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Fig. 5.23 — y factors for f(i) = 10 and 1 < f(i) < 32 
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Fig.5.26 — y factors for f( j) = 13 and 1 < f(I) < 32 
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a curve that Is an Integer multiple of the lower bound value. It Is 
always the case that the theoretical minimum value of y Is the lower 
bound, and that whenever there Is a difference between the lower bound 
and the actual, the actual value Is always an Integer multiple of the 
lower bound. For y i S, the dispersion of y values Is small, and 
Increases for 5 s f(l) s 32. 

In an attempt to assess the distribution of the Y-f actor values, 
plots of the cumulative distribution of occurrence versus the ratio 
of Y actual to y theoretical minimum were prepared,* and are presented 
in Figs. 5.31 to 5.37. For terms with a high frequency of use. It Is 
necessary to Introduce a weighting factor, which as shown In the next 
section Is a stable and well behaved factor. At this point, sufficient 
evidence has been accumulated (Table 5.5, and Figs. 5.15 to 5.37) to 
satisfy the hypothesis that the term-term co-occurrences are definable 
as a function of the term frequencies of use and are directly propor- 
tionate to that factor. 

5.4 THE RETRIEVAL QUANTITY MEASURE 

As described previously', the Retrieval Quantity (R^) measure 
Indicates the quantity of documents (references) that are output by a 
DRS In response to a formal Inquiry. The purpose of this section Is 
to develop an operational form of such a measure, and to test the 
measure with a set of actual Inquiries on an operational system. 

The procedure for predicting R^ for an Inquiry entails several 
steps: 

Note this analysis Is restricted to TXT(1 ,j) > 0. 
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Fig. 5.31 — Cumulative frequency of the ratio of actual y 
to theoretical y for terms with frequency of use of 3 co- 
occurring with terms with frequency of use of 1 to 3 
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Fig. 5.32— -Cumulative frequency of the ratio of actual y 
to theoretical y for terms with frequency of use of 4 co- 
occurring with terms with frequency of use of 1 to 4 
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Ftg. 5.33— Cumulative frequency of the ratio of actual y 
to theoretical y for terms with frequency of use of 5 co- 
occurring with terms with frequency of use of 1 to 5 
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Fi9*5.34 — Cumulative fretjuency of the rotio of octuol y 
to theoretical y for terms with frequency of use 
of 10 co^occurring with lerms with 
frequency of use of 1 to 10 
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Fig •5*35 — Cumulative frequency of the ratio of actual Y 
to theoretical / for terms with frequency of use 
of 15 co'occurring with terms with 
frequency of use of 1 to 15 
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Fig. 5. 36 -^Cumulative frequency of the ratio of actual / 
to theoretical / for terms with frequency of use 
of 20 co-occurrir>g terms with frequency 
of use of 1 to 10 
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Fig. 5. 37 — CumuloHve frequency of the ratio of octual y 
to theoreticol y for terms with frequency of use of 32 
co-occurring terms with frequency of use of 1 to 10 
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(1) construction of the formal inquiry (from the user request) 

(2) application of the term co-occurrence factor — y 

(3) determination of 

Step (1) has been discussed in Chapter 4, and steps (2) and (3) will 
be analyzed in this section. 

5.4.1 Application of the Term Co-Occurrence Factor , 
Y» and Determination of Rq 

Step (2) involves the application of y to the explicit conjunctive 

arguments, and inqslicit intersections of the disjunctive arguments in the 

inquiries. Taking a simple example such as T^'Tg, for which R^^ is the 

term co-occurrence value (TXT(1,2)), the lower bound estimate of R^ 

Is: 

Y is found by using the appropriate plot of y and the term frequencies 
of use (e.g., plots like Figs. 5.31 to 5.37), and the variables f(1), 
f(2) and D are readily determined for ar\y operational system. 

A few examples will help to illustrate the estimation procedure: 
1. Request: Retrieve all those documents that discuss the con- 
cept of Coordinate Indexing 

Formal Inquiry: Concept and Coordinate Indexing 
From Appendix C, the frequencies of use of each inquiry term, in 
the sample data are: 

f (concept) = "7 

f (Coordinate Index) « 10 
D « 102 
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From Fig. ^.20, for f(1) = 10, y = 1.46, and, 

\ - (1.46)(^) = 1 

which Is exactly correct, for the sample data base. 

2. Request: Retrieve all the documents that discuss classifi- 
cation and clumping 

Formal Inquiry: Classification and clump 
From Appendix C, the frequencies of use for each term are: 
f (classification) = 20 
f (clump) = 5 

From Fig. 5.28, for f(1) = 20, y = 1.02 and the theoretical lower 
bound estimate of R^ Is: 

Rq = (1.02)(^) = l 

which is less than the actual number (3) of documents described by the 
two terms. In the sample data base. 

These examples show that for combinations of low frequency of use 
terms the lower bound theoretical y-f actor leads to accurate R estl- 
mates, but tends to diverge from TXT(1 ,j) as f(1) and/or f(j) Increases. 
However, when the lower bound y value causes the R^ estimate to be less 
than the actual value, the difference or correction Is always an Inte- 
ger multiple of y. 

One way to correct for the underestimation for large f(1) is to 
employ a simple weighting scheme. That Is, to apply weights (proba- 
bilities) to Integer multiples of y lower-bound, with the weights 
reflecting the proportion or frequency of occurrence of the values of 
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TXTd.j) for the terms 1 and j of Interest. For example, the es- 
timate for a n-term conjunction would be 



ERIC 



Rq = (Ha,) 



^(fUhfUl) 



2y(-) 



+...+ nY(*) 



where n « Cf(^)^f(j)3,„^n^„^„,^ and can be estimated from plots of the 
cumulative frequency of the ratio of actual to theoretical y's, as In 
Figs. 5.31 to 5.37, or from the cumulative distribution of the values 
of term co-occurrences, such as In Fig. 5.38, or the density distri- 
bution of the values of the term co-occurrence, as In Figs. 5.39 and 
5.40. 

For example 2 above, the corrections are determined from Fig. 
5.40, for f(1) « 20 and f(j) = 5; 

a.| = 0.51 
og » 0.21 
oj = 0.10 
= 0.06 
og « 0.04 

The Rq estimate Is now: 



1+(.51)(1) + (.21)(2) + (.1)(3) + (.06)(4) + (.04)(5) 



1.02(fg^; 



= 2.72 



which Is a much better estimate of the actual value of 3. The distri- 
bution of a^'s Is quite stable, and In Section 5.4.2 they are Incorpor- 
ated into the y versus fCi) plot (see Fig. 5.42). 

Unlike the above simple examples, most requests are a string of 
conjunctively and disjunctively related terms, and In general the string 
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Fig .5.38 — Cumulative frequency of occurrence of TXT(i,j) 
for terms with f(l) = 1,2,3,5, 10, 15,20,26 & 32 
and 1 f(j) 32; only non*-zero co~ occurrences 
plotted 
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Fig .5. 39 — Density distribution of occurrence of TXT{i, j) for 
terms with f{i) = 1,2,3,5, 10, & 15, and 1 < fQ) 
< 32: Only non-zero TXT (i, j) plotter 
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Fig.5.40 — Density distribution of occurrence of TXT (i, j) 
for terms with f(i) = 20, 26 & 32, and < f ( {) < 32: 
only for non*-zero TXT (i, j) 
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win contain more than two terms. When more than two terms rre In- 
cluded In an Inquiry, the estimation of requires an Iterative pro- 
cedure. For example, consider the following Inquiry: 

T.| and Tg and Tj and ... and T^^ 

To estimate one must: 

(1) determine f(T^),. .. ,f(T^) 

(2) determine y for f(T.|) and fCTg), and the theoretical value 
of TXTdpTg) by Y-(f(Ti)-f(T2)/D); this value Is the 
Intersect of T-j and Tg 

(3) call the Intersect of T.| and Tg. T.| and determine the inter- 
sect of T.| and Tj, as per step (2) 

(4) repeat steps (2) and (3) until the Intersect of T^_.| and 

Is determined; this Is the R^ estimate for the n-term 
conjunctive series 
In the event that a request contains one or more disjunctions, 
the above Iterative procedure Is modified as follows. Consider an 
Inquiry of the form: 

(T, or Tg) AND (T3 or T^) 

To estimate R^, recall that 

RqCT^+Tg) = f(T,) + fCTg) - TXT(T,,T2) 

and Incorporate this relationship In the Iterative procedure: 

(1) determine f(1), for 1=T.|, 1^* ^3 and T^ 

(2) determine y for f(T,) AND f(T,), and the theoretical value of 
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TXT(T,,T,) by Y-(f(T,)-f(T,)/D) j this is the Intersect of T, 
' . f(T,)-f(Tj,) ^ 

and Tg, and therefore = f(T^) + fdg) - y — ^5 — ^ 

(3) repeat step (2) for all other disjunctive pairs. 

(4) when all the disjunctive groups have been reduced to their 
"net" respective T^'s, the remaining expression Is simply a 
conjunctive series and the estimate Is determined as for 
the previous example. 

At times an Inquiry will contain an explicit negation of a term, 
such as In the following example: 

AND NOT Tg 

To estimate R^, an additional modification of the above procedure Is 
required. Recall that, 

Rqd^-Tg) = f(T,) - TXT(1,2) 

yields the net R^. Therefore, for those clauses In which there Is 
a negated term, the above relationship Is determined, and the resulting 
net T^ Is used to compute the remaining conjunctions and/or disjunc- 
tions of terms. 

Having established an Iterative procedure to estimate quantity 
output for complex Inquiries, the next step Is the evaluation of the 
Rq estimation process. 

5.4.2 Testing the Rq Estimate 

The data and Illustrations presented thus far reflect the sample 
data, and It Is necessairy to extend the findings to the test system 
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to evaluate the estimate. In order to do this, certain logical prop 
ertles of the relationship 

TXT(i.j) . , fin^ 

must be established. 

As demonstrated, the above relationship Is linear with slope of -1 
In log-log space.* Further, for any DRS, all curves for any com- 
bination of f(1) and f(j) are derivable from the theoretical curve 
for f(1) • 1 and 1 < f(J) i D. To show this, the first step Is to 
determine the Intercept for the curve f(1) « 1 and 1 i f(j) i D. 

The ordinal Intercept for f(1) « 1 and 1 < f(1) < D Is defined at 
f(1) « 1 and f(j) - 1, which yields 

TXT(1,J) « 1 « 

or 

Y » D 

which Is the value of the Intercept on the y-axls. The Intercept on 
the f(1) axis for the curve f(1) = 1, 1 t f(j) < D can be determlnftd 
In a similar manner. Setting f(1) « 1, and f(j) » D yields 

Txrdj) 

or 

Y " 1. 

Therefore, all one needs to know to establish the value of the 
Intercepts for the curve f(1) ■ 1 and 1 < f(j) < D, Is the size D of 
* 

For the theoretical lower bound. 
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the system corpus, and that the term usage versus rank distribution Is 
approximated by the MEZ canonical form. 

The curve just determined Is the loner and upper bound for all 
values of y for terms with f(1) » 1, and 1 ^ f(j) < D. In addition, 
this curve Is the upper bound on the values of y for aVl, other y for 
any combinations of term frequency; that Is, for 

1 < f(1) £ D 
1 $ f(j) $ D 

Further, on the basis of the above curve for f(i) ■ 1, and 1 < f(j) < D 
the theoretical }oner bound values of y for all other combinations of 
f(1) and f(J) can be determined. The procedure to determine these 
lower bound curves Is Illustrated In Fig. 5.41, for the test corpus 
with D ■ 416, and f(J) » 10, and consists of the following steps: 

(1) locate f(j) « 10, on the abscissa (point I In Fig. 5.41). 

(2) follow the vertical line up to the Intersection (point II) 
with the line for f(1) » 1, 1 $ f(j) < 416. 

(3) follow the horizontal to the ordinate Intercept (point III), 
which gives the value of y for f(1) » 1 and f(j) » lo. 

(4) trace the 45" line, with slope -1, to Its Intercept with 
the abscissa, at y*1 (point IV) 

The resulting line betireen points III and IV, and extrapolated 
beyond. Is the theoretical lower bound for y for f(j) » 10, and 1 < 
f(1 ) s D'; where D' < 416 - f(1). That Is, If one were to estimate 
the Intersect, TXT(1,j) of two terms with f(1) « 10 and f(j) » 416, 
respectively. It Is clear that TXT(1,j) « 10, by definition. There- 
fore, In order that the theoretical lower bound curve satisfy that 
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Fig .5 .41 — y factor versos fO) for ILR oVjument 
retrieval s/stem, stage II 
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condition, it must be asymptotic to the line for y=U and intercept 
that line in the vicinity of f(j) = 416. 

Given the basis for constructing the theoretical envelope, and its 
bounds, of Y values, the next step is to determine the best estimate 
values of the y factor between the upper and lower bounds, for the test 
system. The best estimate values of y can be determined using the fol- 
lowing assumptions about — and properties of — coordinate index DRSs. 

(1) the sample data base is representative of the parent or test 
system, and the divergence data indicated in Figs. 5.31 to 
5.40 can be extrapolated to the value of ^(i)n,ax in the test 
system. 

(2) the upper bound of the Y-curves for any term is defined by 
the curve of slope (-1) for f(i) = 1 and 1 ^ f(j) i D, in 
log-log space. 

(3) the lower bound of the Y-curve for any term j, is defined 

by the curve, with an ordinal intercept defined by the inter- 
section of f(j) with the curve for f(i) = 1, 1 < f(j) * D, 
and an asymptote to y=1 i" the vicinity of D' , where D' = 
D - f(j). 

(4) for any two terms, the value of the y factor must be the same, 
regardless of the sequence of determination; that is, the 
curves must possess a symmetry such that 

^f(i).f(0) ' ^f(j).f(i). 

This property follows from the fact that TXT Is symmetric; 
i.e.. TXT(i,j) = TXTU,1). 
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Using the above assumptions and properties, the best estimate y 
factor curves for the test system were derived* and are presented In 
Fig. 5.42. From property (3), one would expect that the test system 

Y curves would be asymptotic to the line of slope y=1 for f(1) = D. 
However, there Is Instead an apparent convergence of curves at 3 $ 

Y i 5, for high f(1). Since the data sample bad very few points In 
the range of 30 < f(1) < D, It was not possible to analyze this char- 
acteristic In depth. However, It Is likely that the reason for this 
property Is that the test system Is small (D « 400 and T = 400) and 
as the product of f(1)*f(J) approaches or exceeds D, the Intersection 
of the two terms Is going to be substantial, and hence the convergence 
of Y-curves for high f(1) (but « D) at y > !• 

In order to evaluate the estimation. process, based on the y- 
curves In Fig. 5.42, a set of 15 requests of various content was gen- 
erated. The requests are considered to be typical and corpus subject 
related, and are not based on the descriptions of any one document 
or set of documents.* The test Inquiries are listed In Table 5.6. 

The Rq values, both estimated and actual, for each Inquiry were 
determined for direct match searches and are reported In Table 5.7.** 
The estimated values a.-a, for all Inquiries, very close to the ac- 
tual Rq, and clearly demonstrate that the Retrieval Quantity for an 
operational coordinate index DRS can be accurately predicted for formal 
inquiries. 

* 

The intent was to avoid the early Cranfield (see Ref. 130) or 
"Moore's" type inquiry, in which requests are generated from document 
descriptor sets. Such inquiries test the system retrieval search link- 
ages, but are certainly not representative of the typical user request. 
** 

Some sample computations are Included in Appendix D. 

ERIC 
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Table 5.6 
TEST INQUIRIES 



Inquiry 



1. Auto, indexing and auto, ab- 
stracting and (theory or analy- 
sis or experiment) and not 
manual indexing 



Form 



Term 
Frequency 

^3^ 



2. Comp. linguistics and syntax 
and semantic 



^1 • ^2 • h 



3. Natural language and (auto. T, • (Tj+T,) • T. 
indexing or auto abstracting) 
and experiments 



'2' 
^3^ 

^3^ 



4. STAT association and (clump 
or cluster) and experiment 



Ti . (Tg^Tj) . T, 



5. Automatic and indexing and 
(coordinate or subject heading) 



T,. T^ • (T3.T,) 



6. Measure and relevance and 
evaluation and (theory or 
performance) 



T, • T2 • T3 .(T.+Tg) 



7. Simulation and (retrieval or T, •(T^+T^+T-) 
info, retrieval or document) 
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Table 5 ,6— continued 



Inquiry Form 



8. Theory and (documentation or T, • (To+T-j) 
info, retrieval) * ^ ^ 

t 

9, Design and retrieval system and T, • Tp • (T^+T-) 
(on-line or real-time) \ d 3 4 



10, Design and automatic and re- T, • T^ • T 



trieval system 



(design or evaluation) 



(evaluation or analysis) 



lation 



1 '2 '3 



11, Computer and education and ^1 * ^2 ' ^^3'*'^4^ 



12, Question and evaluation and L • T^ • (T^+T-) 
(Boolean or logical) ' ^ ^ ^ 



13, Depth-of- indexing and T^ • {"^2*^3^ 



14. Natural language and trans- T, • T 



1 '2 



15, Abstracting and centers and "T, * Tp • T-, 
jj^trolled ' ^ ^ 



Term 



Frequency 


f(T,) 


= 


20 






10 




= 


84 


f(T,) 


= 


9 




= 


15 




= 


3 






1 


f(T,) 


= 


9 






28 


f(T3) 




1 c 

l9 






69 






15 


f(T3) 




9 






44 


f(T,) 




33 






44 




s 


13 


f(T,) 


s 


4 


f(T,) 




8 






44 


f(T3) 




53 


f(T,) 




38 






31 


f(T,) 




13 


f(Tp) 


s 


5 
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Table 5.7 



COMPARISON OF ACTUAL AND ESTIMATED 
Rq FOR DIRECT MATCH SEARCHES 



Inquiry 


Rq-Actual 


Rq-Estimate 


1 


2 


1-2^ 


2 


2 


1-2 


3 


2 


3-4 




0 


1-2 


5 


2 


4 


6 


0 


3 


7 


2 


3 


8 


13 


15 


9 


9 


0-1 


10 


1 


1-2 


11 


3 


4 


i 


1 


2-3 ! 


13 


6 


5-6 j 


14 


12 


12 


15 


1 


0-1 1 

i 



^The Rq estimate is frequently 
a non-integer value and the ranges 
indicated are integer bounds. 



145 



5.5 THE LIKELIHOOD OF NON-ZERO TERM-TERM CO-OCCURRENCES 

The analysis and results presented thus far have implicitly assumed 

that the probability of term- term co-occurrences for terms with f(i), 

> 0 (for actual inquiry combinations for a homogeneous corpus) is 

significantly greater than zero. Thus the y factors presented in Fig. 

5.42 can be viewed as the values to estimate TXT(iJ), given that 

> 0 and that terms i and j do indeed co-occur. Since the 

DXT matrix is usually very sparse (for the test data sample approximately 

95 percent of the cells are zero), and also that the TXT matrix is usually 
* 

sparse (for the test data sample, approximately 82 percent of the cells 
are zero), some insight into the behavior of 

P(TXT(i,j)|f(i), f(j) > 0) 

as a function of f(i), f(j),and the number of terms with the same fre- 
quency of use is desired.. 

The theoretical probability, based on independent term usage, that 
the co-occurrence of two terms is greater than zero, given that each 
term has a frequency of use greater than zero, can be determined as 
follows: 

Given: D documents = {d} 
T terms (active) = {t} 

* 

II can be shown that the sparcity of TXT is always less than or 
equal to the sparcity of DXT; where TXT = (DXT)T(DXT) and DXT(i,j) ^ 0 
for all 
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the frequency of use of term t; 1 ^ i ^ D 

the number of terms with frequency of use i; 1 ^3. ^ D 
(e.g. 9 = the number of = 3) 

where: 

k 

m=1 

T k 

y i = I m =N 
t«1 m=1 ^'m 

N = the total term frequency of occurrences 

For this analysis, one may specify an initial distribution for 
(j^ and then for all the terms {t}, to select i^ documents 

it 

at random and without replacement and use the terms to describe the 
document. 

For computational convenience the probability of non-occurrence, 
f(TXT(tg,t,j) = 0 will be determined, and then the P(TXT(tg,t,j) > 0 = 
1 - P(')' A general condition on "P is that: 

P = 0 for i+ + i+ ^ D 
^a ^b 

For the case in which i^ + i^. < D, the simplest situation is where 

^a ^b 

only one term is used i,. times and only one term i. times; that is, 

*a *b 
J. = J. =1. For notational convenience, let 

This constraint is necessary because any one term can be assigned 
to any one document only once. 



let 1j = 
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\yio) = P(TXT(t^.t^)) = 0 



For this case: 



t 5 - . D-x-1 D-x-y 



x.y - IT IPT "Dly 
(D-x-y) !bl 



However, the more general condition Is when there Is at least one term 

that Is used 1^ times and at least one term that Is used 1^ times; 

a b 
that Is, > 1 and jy > 1 and jj^ ^ jy. 

Let X = the nunber of documents described by at least one of 

the terms with frequency of use x 

Y s the nunber of documen described by at least one of 

the j terms with frequency of use y 

Given X and Y, for those terms with the same frequency of occurrence, 

the probability that there are no co-occurrences Q)(^y^°^ °^ these 

terms Is exactly the probability „(o) defined above: that is. 

When the specific number of co-occurrences X and Y are not known, the 
value of P(X) and P(Y) must be determined. Under these conditions, 
the probability that there are no co-occurrences Is defined as 
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Q(o) = I P(X)P(Y)Qj(y(o) 

= i I p(x)p(y)q„y(o) 

X=x Y=y 

where P(X) = probability that X documents are described by those j 
terms with frequency of use x. 




A special case of the above general relationship Is the proba^- 
blllty of no co-occurrence among the terms with frequency x, where 
for x«jj^ < D, 

(1) the nuiroer of ways th6 event no-occurrence can occur equals 

("•J 

and 

(2) the number of possible events In the space D with terms. 
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with frequency of use x, mapped onto that space equals 




Thus the probability of no co-occurrence for terms j^^ with the same 
frequency of occurrence Is: 



III. 




Of particular Interest are the lower bound conditions or proba- 
bilities that describe the co-occurrence of terms with f(1),„^n 

I4. = 1; that Is, terms with frequency of use of one. This probability 
^a 

can be viewed as the threshold case because, as shown In previous sec- 
tions, the co-occurrena» of terms 1 and j with f(1> and f'j) > 1 is 
always greater than or equal to the f(i),^<)n ~ 1 case. 

A plot of the theoretical probability of at least one co-occur- 
rence for terms with f(1) or x = 1 with varying values of (1 i 

i D) Is presented In Fig. 5.43. In the range of = 12 it 1s as 
likely to have a co-occurrence as not, for the theoretical distribu- 
tion, and for any values of > 12 the likelihood of at least one 
co-occurrence is very high. The probability of co-occurrence for the 
actual test data is, for the few points computed, greater than or 
equal to the theoretical case. As such. Fig. 5.43 affords a conven- 
ient lower bound estimation on the probability of at least one 
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0 10 20 30 40 50 60 70 
J^/ the number of terms with the same frequency of use 

Fig. 5. 43— Theoretical probability P = l -(P(TXT(i, j)=0) 
versus 1< jx^80, for f(i)-f(i) = l 



ERIC 
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co-occur|pence for terms with frequency of use of 1 as a function of 
the number of such terms. 

Operationally, this means for the test sample where ' ^ 
X ■ f(1),„^n * l.that one Is better off assuming that a co-occurrence 
exists than not and at worst the R. estimate will be off by one In e 
few cases. 

A sample of the term-term co-occurrence for the test data Is 
tabulated In Table 5.8. The columns are labeled In terms of the vari- 
ables noted In Eq. II. 



Table 5.S 

TERM-TERM CO-OCCORRENCES BETWEEN TERKS 
WITH DIFFERENT FREQUENa OF USE 



X 








£TXT(t,.t^) 




80 


1 


80 


83 




1 


2 


36 


61 






3 


44 


68 






4 


39 


91 






5 


24 


76 






6 


17 


'•4 






7 


10 


»7 






8 


13 


■>6 






9 


6 


18 






10 


4 


14 






11 


3 


19 






12 


6 


47 






13 


4 


23 






14 


3 


8 






15 


3 


18 






16 


1 


10 






17 


2 


12 






20 


2 


15 






21 


2 


29 






22 


2 


24 






24 


1 


15 






26 


2 


10 






27 


1 


6 






32 


1 


10 



No tenas In U.e t€:St data 
were used for f(i) » 18, 19, 23, 
25, 28, 29, 30, 31. 
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5.6 WORD ASSOCIATION COEFFICIENTS 

The relationship between the elements In the TXT matrix and the 
p<«d1ct1on functions, Y(f(1)*f(j)/D)* Is based on the assumption that 
descriptors are assigned to documents In a binary manner. That Is, a 
term Is or Is not assigned as a descriptor, or In other words, the 
term assignment weights are 0 and 1. 

In many Instances, the^e is a need to elaborate upon an Inquiry 
so that additional documents can be retrieved. A common technique to 
accomplish Inquiry expansion Is through word association; that Is, by 
disjunctively Incorporating new terms with those terms In the Inquiry, 
with which they are highly correlated/associated. By necessity, these 
correlation relationships have non-Integer values, and are derived from 
the TXT distribution. 

In the Institute of Library Research DRS, a coefficient of asso- 
ciation Is determined for all co-occurring Index terms. For purposes 
of processing convenience, only the four highest correlating terms 
are retained as association words for the base term. In the event 
tiiat an Inquiry Is to be expanded, a disjunct Is formed with the origi- 
nal term and Its four most highly correlated terms. In general, the 
associated set of terms will be different for each Index term> and 
the meirbers of the set of associated terms can be different for any 
one term depending on the word association measure used. 

It can be shown that the term co-occurrence factor y can also be 
used to estimate word association coefficients. Foliating Kuhns (81), 
the form of a general class of coefficients of association Is defined 
to be: 
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where 

6(i,j) = iTXT(i,j) -lllhfLil 

A sample of the set of candidate expressions for a are listed in 
Table 5.9. For the derivation and rationale of these forms, and their 
application^ see Kuhns (81) and Maron,et al. (98), respectively. 

As noted earlier, 

TXT(i,a) = y 

and substituting into C^(i,j), yields 

'^;(<-j> = 

Therefore, one can estimate the coefficient of association for any 
two terms knowing the y factor for the DRS. 

5.7 SYSTEM GROWTH IMPACT ON RETRIEVAL QUANTITY 

All operational DRSs must sustain changes in corpus collection 
and content, and thesaurus size in order to remain useful over time. 
However as the corpus and thesaurus change, particularly in size, 
the perfonnance of the DRS also changes; for the same inquiry it is 
very possible to get different output sets from a DRS at different 
points in time. 

In order to demonstrate the sensitivity of quantity output to 
changes in the system corpus and thesaurus for different search 
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Table 5.9 

COEFFICIENTS OF ASSOCIATION PARAMETER - a (81) 



Symbol 


Parameter a 


Dtscrlptlon of Parameter 


S 


0/Z 


Measure of the separation or 
**d1 stance between the terms'* 


G 




MHsure of the angle between the 
vectors representing the terms 


W 


Min(f(i). f(j)) 


Measure of the conditional prob- 
ability on weak evidence 


R 


Max (f(i). m) 


Measure of rectangular distance 
between the terms 


P 


-* 


Measure of the pr-^portlon overlap 
between the Urm 






L 


v.- 4^ 


Measure of the linear correlation 
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strategies, an experiment was performed on the ILR DRS over differ- 
ent stages of its development. The comparative performance of the 
test Document Retrieval System between stage 1 and 2 is based upon 
a set of common questions and three word association files and 
direct match searches. 

From the tabulated data in Table 5 JO and the plot of the meas- 
ure of coefficient of word association 6 in Fig. 5.44, the dynamic 
property of the coefficients of association can be seen. In a11 
cases, the S-*measure produced less output as the corpus and thesau- 
rus increased in size from stage 1 to stage 2. This is a result of 
both the measure and the laboratory search routine. That is, the 
denominator of the measure is directly proportional to any increase 
in corpus size, hence making the measure smaller with increasing 
corpus size, as the numerator increases at a much slower rate. The 
laboratory search routine employed also contributes to this decrease 
in output in that it has a default relevance threshold condition that 
ignores any documents that do not have a relevance value to the 
query, measurable in the first three significant digits. Hence any 
document without a relevance measure in the first three significant 
digits will not be retrieved. 

On the other hand, the measure G provided an increase in output 
for all questions from stage 1 to stage 2. The W-measure provided 
no increase for two cases, and a slightly larger set for two cases. 

It is interesting to note that the intersection of the output 
sets (see Table 5.10) is surprisingly ..all, for the same measure 
and same question for the two stages, clearly, some documents that 



156 



8 




o o o o o 

m CO CN4 



4nd4no 4uaiun3op jo X|!4UDnQ 



157 



Table 5.10 
QUANTITY OUTPUT FOR STAGE 1 AND STAGE 2 









Cardinal Measure 






Inquiry 


Coeff. of Assoc. 
neasure 




Output 
Set 


Intersection 


Union 






s 


1 

2 


3 
2 


1 


3 




1 


6 


1 

o 
c 


2 
11 


2 


11 






U 


1 

2 


2 
2 


1 


3 






Direct Hatch 


1 
2 


1 
1 


1 


1 






S 


1 

O 

c. 


4 

2 


1 


5 




2 


6 


1 

o 
c 


17 
32 


15 


34 






H 


1 

2 


8 
14 


8 


14 






Direct Match 


1 

2 


z 

2 


2 


2 






S 


1 
2 


16 
11 


3 


24 




3 


6 


1 

Z 


1 

23 


1 


23 






H 


1 
2 


2 
2 


2 


2 






Direct Hatch 


1 
2 


2 
2 


2 


2 






S 


1 
2 


4 

0 


0 


4 




4 


6 


1 
2 


1 

13 


0 


14 






U 


1 
2 


0 
3 


0 


3 






Direct Hatch 


1 

2 


0 

0 


0 


0 
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the system attributed as being relevant to a question In stage 1 are 
not being retrieved In stage 2. The cases for this difference In 
content of the output sets Is a characteristic of the sensitivity of 
the different measures to system growth. 

The experiment does show that the change In output performance 
with system growth Is certainly non-linear (see Fig. 5.44). And, 
further. If one Ignores the S-measure It can be seen from the G- and 
W-measures, and by examination of denominators of some of the other 
candidate measures In Table 5.9, that the output set will always be 
as large and. In the majority of cases, much larger for the same ques- 
tion as the system grows. 
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Chapter 6 

CONCLUSION AND SYNTHESIS OF FINDINGS 

6.1 INTRODUCTION 

The purpose of the analyses in the previous chapters is to pro- 
vide a basis for the development of management and design aids for 
DRSs, through the investigation of fundamental relationships between 
the components of DRSs. 

The objective of this chapter is to summarize and synthesize 
those findings and to discuss their implications for DRS management 
and design. 

6.2 GENERAL CONCLUSIONS 

On the basis of the experiments and analysis reported in Chapter 
5, it is concluded that retrieval quantity can be predicted, and that 
the underlying characteristics which permit the estimation have 
potential as DRS management and design aids. 

To briefly review, the findings made are believed to hold for 
a wide range of DRSs, such as: 

Corpus size: 100 to 50,000 

Thesam size: 300 to 13,000 

Term Frecvency of Use: 1 to 4,200 
They are based on the detailed analysis of a representative sample 
DRS from this range, and consist of the following: 

(1) The MEZ canonical form of f(r) = K(rfB)'°^ characterizes the 
term-frequency-of-use versus term rank distribution for a 
wide range of manipulative index DRSs. The parameters K,B 
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and a are estimated as a function of corpus size, thesau- 
rus size and depth of indexing. 

(2) Term- term co-occurrences are not generated by random sam- 
pling from the thesaurus. 

(3) The value of term-term co-occurrences is directly propor- 
tional to the function of the product of the frequencies of 
use of the terms, and can be predicted by the relationship 

TXT(i.J) = v(iiV^) 

where y is defined as a fur.ction of term frequency of use 
and corpus size. 

(4^) The Retrieval Quantity of a formal inquiry can be accurately 
predicted as a function of y, term frequency of use and 
corpus size. 

(5) For the class of coefficients of association of the form 
(see Kuhns (81)) 

C (i,j) 
a a 

the nunierator, 

= TXT(I.J) - fti^ 
can be estimated by 

6'(i.j) = (Y-l)(^^^^i^) 
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(6) The probability that two terms, with frequencies o^ use 
greater than or equal to one, will co-occur is definable by 
an ordered family of curves with an upper and lower bound 
as indicated in Fig. 6.1. Each curve is a function of the 
frequencies of use of the two terms, the number of terms 
with the same frequency of use, and the size of the corpus. 

(7) Terms with the same frequency of occurrence, have similar 
DRS statistical properties; that is, the distribution of 
the number and value of their term- term co-occurrences are 
approximately the same. 

(8) The impact of DRS corpus and thesaurus grcwth on retrieval 
quantity can be predicted. 

6.3 MANAGEMENT AND DESIGN AIDS 

The management of a DRS entails cost/benefit analysis of system 
operations and plans, measuring system erformance for different 
tasks, and controlling the system processes. It is not the intent 
to delve into a discourse on DRS performance evaluation, but rather 
to describe h m the findings (summarized above) can be used to aid 

in some,aspects of DRS management and design. 

(1) Tuning Inquiries. By estimating R^ for an initial inquiry, 
the grammatical combinations and/or nunfcer of terms can be 
modified to yield different expected R^'s. Through this 
pre-processing exercise the DRS user can adjust inquiries to 
retrieve a more preferred quantity of references. In this 
way the marginal effect on quantity output of adding or de- 
leting a term of a certain frequency of use, and creating 
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f(j) = D Theoretical upper 




ix' numoer of terms with the same frequency of use 



pjg 6J — Theoretical fatnily of curves defining the lower bound 
of the probabiliW of co-occurrence of two terms with 
f(i)=l, l<f(j)<D, and l^jx^D, \y=\ 
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different logical combinations can be estimated. 

By employing such a "tuning" measure it Is quite likely 
that the DRS users will find the system more understandable 
and convenient, and management can reduce the potential 
number of user disappointments In system responses. 
(2) Predicting and Monitoring the Impact of System Growth . As 
the system corpus and thesaurus change over time, both the 
quality and quantity of the system output will also change, 
for a constant set of Inquiries. The measure can be 
used to estimate the Impact of corpus and thesaurus change 
on -he system output quantity. The most straightforward 
application Is to determine the set of y factors for an op»- 
eratlonal DRS with a specified corpus and thesaurus size, 
and then iS D Is Increased to project a pruportlonate In- 
crease In the Y-fdctor bounds. The new ys can be used to 
estimate the changes In R^, for a specific Inquiry. Using 
the Rq measure In this way provides some Insight Into the 
dynamic characteristics of DRSs. 

One could also use the R^ measure to estimate the Impact on 
-jQUtp^ujLquantlty due to d -^nges In the thesaurus with the 
corpuf held constant. In this process, the frequency of use 
of the thesaurus terms would be changed, and/or new terms 
added. The bounds of the y-f^ctor would remain the same, 
but the likely value of y for high f(1) would change, and 
the technique for estimating the new y^s Is directly analo- 
gous to^hat used In Section 5.64 to Illustrate the 
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P(TXT(i.j)) > 0 distribution. 
(3) Indexing Process Modification . There are various controls 
that can be imposed on the indexing process, and the meas- 
ure tjn be used to estimate the effect of changes in con- 
trol limits on the quantity output. For example, a manager 
or designer may want to: 

a. Truncate the index term frequency of use distribu- 
tion by specifying f(i)^^n and/or H^)^^ limits. 
The impact, on quantity outputs, of changing the 
values of ^CD^^n/n^x estimated by computing 
Rq at the different values, for a set of typical 
inquiries. 

b. Limit the minimum or maximum nuirber of terms that 
can be used to describe any one document. An inter- 
esting condition to investigate is to alter the 
"current" depth of indexing, Dg, lower and upper 
bounds so as to gradually approach a uniform distri- 
bution in which Dc ' h ' The sensitivity of 

min max 

the quantity output to the rate of change of the 
depth of indexing distribution can be estimated by 
the Rq measure, because the frequency of term use, 
f(i), distribution is indirectly altered and R„ is 
a function ci' the values of f(i). 

c. Specify a limit or a certain distribution on the 
number of terms that can have the same frequency 
of use, over the term-rank space {1,...,D}. By 
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altering the value or distribution, the P^(TXT 
(IJH). 0 s V s (f(1).f(j))„^„ and P^(y«Z). y^.b. 
< I i Yy B, probability distributions are changed, 
and consequently the quantity output for any one 
Inquiry will also be modified. The impact can be 
estimated by R^, because it is a function of the 
various ys related to the terms in the inquiries. 
(4) Inquiry Processing Effort. Given a specifiable file struc- 
ture and an elapsed time distribution for term lookups, the 
number of iteratioris involved in the determination of R^' 
can be used to estimate the average amount of time to pro- 
cess an inquiry* This information could be used by a ORS 
manager or designer to estimate certain resource require- 
ments necessary to satisfy existing or projected user de- 
mands. 

The above exemplary management applications of the R^ measure 
can also be viewed in the context of a design process. Combining 
these applications with certain canonical expressions, noted in 
Chapters 4 and 5, that characterize the fundamental relationships in 
ORSs, one can construct a hypothetic sequence of steps which illus- 
trates their use in the design process. Further this procedure can 
be considered as a basis for a simulation model that would enable a 
desiper to experiment with different parameter values and variable 
limits, prior to the construction of the ORS. The steps envisages 
are as follows: 
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(1) Selection of Corpus Topic 

a. Analysis of user needs 

b. Selection of the published subject arej of Interest; 
fcr example, the field of Operations Research. 

(2) Identification of Periodical Population and Determination 
of Periodical Productivity Distribution 

a. Determination of the tradeoff between number of periodi- 
cals to be collected versus the percent of the relevant 
literature covered, by applying Bradford's Law of Scatter 
(88), Kendall (75) has in fact Investigated the peH- 
odical productivity distribution for Operations Research 
and found that if one collected the five most productive 
journals, 33 percent of the new articles (documents) 
would be captured, or the eighteen most productive jour- 
nals, 50 percent of the new articles would be capture'^, 
or the 67 most produru /e journals would yield 75 per- 
cent of the new articles, etc. 

b. Estimation of the expected growth rate of the literature 
in the' field, and conversely, the death or deletion rate. 
In most cases a sample exponential form ci. in Fig. 1.5 can 
be utilized. 

(3) ' Estimation of the Corpus Size D 

a. From the determination of the required nunber of peri- 
odicals to be collected, an estimate of the initial cor- 
pus size, D, can be made. 
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(4) Selection of Candidate Tenn Frequency of Use Distributions 
a. The most convenient relationship to employ is the MEZ 

canonical form, with the parameters K, B and a determined 
- as in Section 5.3 that is compatible with a corpus of 
size D and se''ected average depth of indexing (e.g., 
' 15 terms per document). 

(5) Determination of the Probability of Teirm Co-occurrence 

a. As a function of the term frequencies of use (f(i)), the 
size of the corpus (D), and the distribution of the num- 
ber of terms with the same frequency of use (estimated as 
* 

in Section 5.3.2), the probability of two terms with 
frequencies of use f(i),f(j) ro-occurring can be deter- 
mined, as discussed in Section 5.5. 

(6) Derivatior of the y-Factors for R 

a. Based on the information determined in steps 4 and 3, the 
Y-factor distribution can be derived as shown in Section 
5.4.1. 

(7) Generate Sample Inquiries 

a. A set of "typical" inquiries, from the point of view of 
form, (and not content), ran be constructed using combi- 
nations of Boolean connectors and terms with various fre- 
quencies of use as specified by the MEZ distribution. 



An alternative approach is wu employ the Waring distribution; 
see Herdan (64, 65) and Jones (73) for a discussion of this distri- 
bution. 
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(8) Estimation of Quantity Output, 

a. Using the y-factor distribution and the procedure devel- 
oped in Section 5.4.1, the quantity output for the candi- 
date inquiries can be predicted (for a direct rntch search 
strategy) . 

(9) Measurement of the Sensitivity of to; 

a. Changes in the corpus and thesaurus size 

b. Changes in the MEZ parameters 

c. Changes in the distribution of the number of terms with 
the same frequency of use 

d. Changes in search strategy 

The standard process of designing DRSs is considerably more art 
than science, with many system variables and relationships at best 
indirectly controlled or left to assume "natural" values by implicit 
default options. This process can be improved by simply taking ad- 
vantage of the statistical regularities that characterize the rela- 
tionship among DRS parameters. The hypothetic design sequence des- 
cribed above is one way in which the design process can be made mope 
fovmal and accurate. Also it provides a basis for a structure within 
which a designer can explait the various canonical forms that char- 
ac^fij^ze the statistical stability of various DRS properties. 
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Chapter 7 

RECOMMENDATIONS FOR ADDITIONAL R'ISEARCH 

7.1 INTRODUCTION 

•There are a number of directions for future research in the area 
of analytic/simulation modeling of Document Storage and Retrieval 
Systems* Several suggestions are briefly noted in this chapter in 
the hope that they will provide a point of departure for one or more 
subsequent research efforts. 

7.2 CORPUS HOMOGENEITY AND HETEROGENEITY 

The DRSs investigated in this study are basically homogeneous in 
subject content; that is to say, the corpus is dedicated to a single 
subject. The ILR DRS has a homogeneous corpus and the subject is In- 
formation Science. A measure to distinguish between a homogeneous 
and heterogeneous corpus has yet to be developed. Also, a means of 
measuring the impact of more or less heterogeneity on DRS performance 
is needed. 

Presumably, a measure could be based in part on the character- 
istics of the DXD matrix, which is defined by the operation 

(DXT)(DXT)^. 

The DXD matrix gives the document-document association profiles, and 
presumably in a homogeneous corpus the majority of documents would be 
highly associated. The converse woul^ hold for a heterogeneous cor- 
pus. . 
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7.3 DISTRIBUTION OF TERMS WITH COMMON FREQUENCIES OF USE 
Little, if any, control is ever exercised over the number of 

terms allowed to have the same frtquency of occurrence, Jx. From the 
MEZ relationship, the Waring distribution (see Herdon (64, 65)) and 
Zipf's two "Laws "(see. Booth (10)), there is an implied increase in 
Jx as the rank of the term decreases. This simply means that there 
will be more terms that are used infrequently than there are terms 
that are used frequently. The issue of interest is, what should Jx 
be for a specified term rank and for certain system characteristics ~ 
D and T, and what is the impact of Jx on DRS performance. 

It i- clear that Jx has a marked impact on the probability of 
co-occurrence of terms with frequencies of use f(i), f(j). This is 
illustrated in Fig. 5.43, in which the thaoretical lower bound of the 
actual P(TXT(i,j)/f(i),f(j) > 0) is plotted for f(i) = f(j) = 1 and 
1 ^ Jx < D. The various formulae presented in Sec. 5.5 provide a 
point of departure, for any additional computations of P(TXT(i,j) = 
S) for a specific f(i), f(j) and Jx. 

7.4 THE MEZ CANONICAL FORM 

Mandelbrot (94, 95, 96), Herdan (64, 65), Zipf (153), and 
Krevitt (80) have investigated various term usage relationships, pri- 
inarily in a text-free setting. For thac unconstrained setting, the 
MEZ exponent o is considered always to be in the range of 1 a a ^ 
1.6. However, the system vocabularies of DRSs are very constrained. (in 
the predicate calculus sense), and for the test system a very good 
fit between the MEZ and the term frequency of use versus ran, urve 
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was possible with o = 0.9. Clearly if one were to reduce o to zero, 
the frequency of use versus rank distribution would yield a uniform 
distribution. Intuitively then as one reduces o one constrains the 
"richness" of the vocabulary. Noteably, Mandelbrot (94) has observed 
that in children's talk (an example of constrained vocabularies of a 
different type) it is possible for o s i. The issues of interest 
are: What should a be in order that the DRS perform well, and how 
can one best adjust the DRS to move toward a more preferred term fre- 
quency of use situation? And, as the DRSs grow over time, what 
changes can be expected in the parameters K, B and a. 

7.5 DEPTH OF INDEXING DISTRIBUTION 

The depth of indexing distribution portrays the frequency dis- 
tribution of the assignment of terms to documents. Of the systems* 
on which. empirical data was available, the basic form of the distri- 
butions is very similar; in fact, sufficiently similar for one to sus- 
pect that a canonical form should exist. On the .jasis of a crude 
fit, the Beta distribution: 

f(w,x,*)= ii||l]lLw^l.w)* 

where, w is the normalized depth of indexing level defined over the 
finite interval 0 < w < 1 , and x and «j» are constants. Wiederkehr 
(143) has developed certain forms for a modified Beta distribution In 

his discussion of search characteristic curves. Also, Bourne (13), 

J 

* 

The ILR test;:,|y§tem and the systems investigated by Litofsky 

(90). 



172 



Svenonius (127), Swanson (12'^ and Zunde (151) have explored various 
aspects of the depth of indexing distribution. However, no general 
formulation of the expected or likely depth of indexing distribution 
has been devctlopcd, and just as importantly there is no establ shed 
means of linking the depth of indexing characteristics with the term 
frequency of use distribution, and the DRS performance. 

7.6 HIGHER ORDER TERM ASSOCIATIONS 

The vast majority of discussions (this paper included) dealing 
with term- term associations just employ the first order TXT matrix 
relationships. As noted in Chapters 4 and 5, the ielements TXT(i,j) 
provide the degree of association between terms i and j, which is 
also the first order of association. To obtain the higher order 
associations between two terms, one merely takes the appropriate 
power of the TXT matrix. That is, (TXT) yields the n"* order asso- 
ciation between the terms in the thesaurus. Salton (117) has sugges- 
ted a scheme to utilize the higher order associations for expanding 
an initial inquiry. The procedure entails a weighting factor o, 
where 0 < o < 1 which, causes o" to be a monotonically decreasing func- 
tion as n increases. This condition implicitly states that the lower 
order associations are more important than the higher order associ- 
ations. Employing Sal ton's notion of a normalized query vector, Q, 
one then gets the following relationship between an expanded query 
Q^, and the original query Q; 

= QCl + {a(TXT)}^ + {a(TXT)}^ + ... + {o(TXT)}"]. 
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Given that this type of relationship is valid, what are the reason- 
able values of a and n, and what are their effects on the performance 
of the DRS? 

IJ Rq MODEL EXTENSIONS 

Given the basic construct of the model, it is of interest 
to consider how the model can be extended to deal in some way with 
the issue of relevance • 

The most logical step is to employ some means of ranking the 
documents by degree of inquiry term/document descriptor overlap -or 
associative thresholds, or by the weak ordering action .qgested by 
Cooper (35) • The important procedure is to link the output set 
with a relevance measure, which in this case .ould be system defined 
(as opposed to user judgment). Obviously, the simplest case is for 
a direct match search strategy in which the documents retrieved that 
atisfy any explicit or implied conjunction corrbination of terms in 
. inquiry would be judged the most likely relevant subset, and the 
documents generated by the disjunctive arguments in the inquiry less 
likely to be relevant. The analogous argument would hold for a word 
association. search strategy. This elementary ranking of the output 
set would yield at best a binary relevance mapping on R^, which is ^ 
less discrimihdting than desired. 

A more sophisticated approach would be to employ a probabilistic 
mechanism in the DXT matrix that would reflect both the fundamental 
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indefiniteness* in the indexing term selection process, and the 
sti'ength of the term-document assignment. Thus given a term-document 
relevance "weighting" one could introduce relevance thresholds in the 
Rq iterative procedure, and potentially rank the output set. The 
probabilistic structures put forth by Maron and Kuhns (97) '-d Bryant 
(23) appear to be most appropriate. 

7.7.1 Psychological Analogies 

A rather innovative extension of the model structure is to at- 
tempt to characterize the conceptual "dual" or analogous psychologi- 
cal process experienced by humans in searching for or processing in- 
formation, by a similar model construction. That is to say, there 
are certain regularities that characterize Document Retrieval Systems, 
and it is of interest to know whether these are analogous regulari- 
ties that characterize the human thought process of information stor- 
age and retrieval, and, in particular, indexing and abstracting pro- 
cesses. 

There appears to be a sound, though largely unexploited, logical 
basis upon which to investigate the above notion. For example, the 
MEZ relationship is known to characterize the work frequency of occur- 
rence and rank distribution of a variety of languages. In fact, 

*Thfs indefiniteness arises more' from a type of intrinsic uncer- 
tainty or ambiguity than from statistical variation — a sort of 
"fuzzy" -membership of a term to a document descriptor set (see Zadeh . 
(151)) for a fuller discussion. 

"kit 

Suggested by Professor F. N. Nicosia, GraHuate School of Busi- 
ness Administration, University of California, Berkeley. 
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Mandelbrot (94, 95, 96) (see also Brillouin (18)) derived that rela- 
tionsrtip employ .g the notion of the "cost" of a word as the indica- 
tor of its likelihood of use. The hypothesis is that the less costly 
words are used more often than the more costly, where cost is a sur- 
rogate for "effort" to use. Also, Zipf (153) presented the "law" of 
term rrequency of use versus rank within the context of his theory on 
Human Behavior and the Principle of Least Effort (153). An attempt 
was made by Rosenberg (115) to utilize the Zipf relationship for pre- 
dicting index term selection for use, but the performance of that 
model clearly needs to be Improved before an operational construct 
can be developed. It would seem that a weighted Bayesian or condi- 
tioned probability structure is needed to accommodate the many de- 
grees of semantic uncertainty and noise embedded in document discus- 
sions, human communication and indexing. 
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GLOSSARY 



Boolean Algebra — a Boolean Algebra is defined as a distributive lat- 
tice in which each element "a" has a complement defined by its 
negation. This structure, for a defined set T and its elements 
(A,B,...), is defined in terms of the following operations. 
Conjunction; C = A*B, the subset of subclass of all iridex terms 
or elements of T that are both in the subsets of A and B. Dis- 
junction; D = A+B, the subset of all index terms Or elements of 
T which are either in subset A or isubset B. Negation; N = -B 
or B, the subset of all index terms in T which are not in subset 
B. 

Bradfords Law of Literary Yield or Scatter ~ if periodicals are 
ranked into N groups, each yielding the sam^ number of articles 
as a specified topic, the nurnber of periodicals in each group 
will increase geometrically, as per: Irnrn^. 

Coordinate Index — an index system in which the descriptor terms 
are manipulated. There are two classes of coordinate index 
systems : 

a) Pre-coordinate — those DRSs in which the coordination of 
the descriptors takes place during the inquiry generation 
process. 

b) Post-coordinate — those DRSs in which the coordination of 
the descriptors takes place during the inquiry generation 
process. 

Document — any discrete unit of information — articles, reports, 
recordings, etc. 

Document Retrieval Systems — a class of information retrieval systems 
solely concerned with the subject analysis of document content, 
the storage of a set of official surrogates "defining" document 
content, and the "mechanical" search of the surrogate set to 
identify or select those documents most "relevant" to a user's 
formal request. 

Facet Index — a composite index of an item by combining in a pre- 
scribed manner the terms derived from separate relational index- 
ing examinations. 

Indexing — the process in which documents are analyzed, and terms 
Indicating subject content are assigned or derived. 

Mandelbrot-Estoup-Zipff Relationship — the term frequency of use f(r) 
versus rank (r) distribution in a language is a decreasing con- 
vex function in log-log space, and is of the form: 



f(r) = K(r+B)-° 
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Uniterms, Keywords, Descriptors — words or word-pairs extracted from 
a document that, are used to identify the subject content of the 
document. 

Word Relationships — there are four: operational word relationship 
categories that can be employed in DRSs. 

(1) Semantic relationships which manifest the meaning and con- 
text of terms within a language, 

(2) Syntatic relationships which arise from terms as menters 
of word classes and with the class relationships in a - 
structural (grammatical) sense, 

(3) Syndetic relationships which measure the manner by which 
words that are conjunctively coordinated with a given 

or base term cross-reference one another, and 

(4) Statistical relationships which measure the frequency of 
occurrence of terms in a document. 

Zipf "Law" of Term Usage — the relationship between the frequency of 
use f(»*) of a term and its rank (r) in a language based on 
Zipf's Principle of Least Effort, and is^of the form: 

r 

f(r) = Kr"^ 
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INSTITUTE OF LIBRARY RESEARCH - TEST SYSTEM CHARACTERISTICS 
0 Thesaurus Listing (Sample) 
0 Document Descriptor Listinq (Sample) 
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SUBJECT AUTHORITY LIST (98) 



ahareviation^ 

S » SEE 

Sk • SI:E ALSO 

• IN THE SENSE OF (I.E. SCOPE NOTE} 

♦ •'NO DOCUMENTS YET INDEXED KITH THIS TERM 

♦ • TERM NOT ALLOWED • KELATEO TERN TO 8E USED 



♦ASnREVlATION 
ABSTRACT 
ABSTRACTING 
ACCESS 

ACCESSION NUMBER 
ACCURACY 
ACQUISITION 
ADDRESS 

ADMINISTRATION 
ALCEBRA 

♦ ALGOL 

S PROG. LANGUAGE 
ALGORITHM 
ALPHABETIC 
ALPHABETIC ORDER 
ALPHANUMERIC 

* ALTERNATIVES 
AMBIGUITY 
ANALOGY 
ANALYSIS 
ANSWER 

♦ANTHOLOGY 

SA BIBLIOGRAPHY 
APPLICATION 
♦ARITHMETIC 

S MATHEMATICS 

ARRAY 
♦ARTICLE 

S DOCUMENT 
ARTIFICIAL INTEL 
ASSIGNED 
ASSOCIATION 
ASSOCIATIVE 



♦ATTRIBUTE 

S CHARACTERISTIC 
AUTHOR 

AUTHORITY LIST 

SA THESAURUS 

AUfU ABSIR ACTING 

AUTO. INDEXING 

AUTOMATIC 

AUTOMATION 

SA MECHANIZATION 



BATCH PROCESSING 

BIBLIOGRAPHIC 

BIBLIOGRAPHY 

SA ANTHOLOGY 
BINARY 

BOOK ' 

BOOLEAN 

SA LOGICAL 



CALL NUMBER 
CANONICAL 

SA NORMALIZED 

CARD 

CARD CATALOG 

CATALOG 

CATALOGING 

CATEGORIES 

CENTERS 

CENTRALIZED 

CHARACTERISTIC 
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CHEMICAL 
CIKCULATlON 
CItATION 
CITATION INDEX 
♦CLAIM 

$A " COPYHICHT 
SA «»ATENT 
CLASSIP. SCHEMA 
CLASSIFICATION 
CLERICAL 
♦CLUE MORO 

S KEVMORO 
CLUMP 
CLUSTER 
CO-nCCURRENCE 
♦COBOL 

S RROG* LANGUAGE 

CODE 

SN f'DlA DESIGNATION 
COOING 

SN COMPUTER CCCING 
CdEFFECIENT 
COLLECTION 
♦COLLOQUIUM 

SA CQNfERENCE 
i SA MEETING 

I SA SYMPOSIUM 

/ COMRINATirNS 
I ♦COMIT 

, "S^. PROG, LANGUAGE 

! COMMUNICATION 

I COMP LINGUISTICS 

; COMPARISON 

I COMPUTER 

i CONCEPT 

• CONCORCANCE 

t CONDITIONAL fRCB 

CONFERENCE 
! SA COLLOQUIUM 

j SA MEETING 

I SA SYMPOSIUM 

CONNECTION 
I ♦CONSKUTIVE 
I S ORDER 

♦CONSOLE 

S REMOTE TERMINAL 

CONTENT 

CONTENT ANALYSIS 
CONTEXT 
CONTROL 
CONTROLLED 
3 CONVENTIONAL 
CONVERSrCN 
COORDINATE 
COORDINATE INDEX 

SA UNITjERM SYSTEM 
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* COPYRIGHT 

SA CLAIM 
SA PATENT 

♦CORE 

S STORAGE 
CORRaATION 
COST 
COUNT 
COUPLING 
CRANPTELD 
CRITERIA 
CRITICAL 

SN REVIEMINC* NOT VITAL 
CROSS REFERENCE 
CURRENT AMARENES 
CURRICULUM 
♦CUSTOMER 

S USER 



DATA 

•DECENTRALIZATION 
DECISION THEORY 
DEDUCTIVE 
DECREE 

DEPTH OF INDEX IN 

DESCRIPTIVE 

DESCRIPTOR 

SA KEVUORD 

SA TAG 

SA TERM 
DESIGN 

SA PLANNING 
DICTIONARY 
♦DIFFERENCE 

S COMPARISON 
♦DIGITAL COMPUTER 

S COMPUTER 
DISCRIMINANT 
♦OISPUY 

S REMOTE TERMINAL 
DISSEMINATION 
♦DISSERTATION 
DOCUMENT 

SA JOURNAL 
DOCUMENTATION 
DUAL DICTIONARY 



♦ECONOMICS 

S COST 
EDITING 
EDUCATION 
EFFECTIVENESS 

SA EFFICIENCY 
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^FFICfENCV 

SA CFFECTIVfNESS 
♦FIECTRONIC COIIFUTER 

S COi^FUtED 
♦FMFIIIICAl 

S FX»E*|MENT 
♦FNCaOINK 

S COOING 
FNTROFV 
ENTRY 

SN ACCESS ROtNT 

FRROR 

EVAlUATirM 

SA TFST 

SA tiTlllTV 

SA VALUE 
rXFFRIMENT 
EXTRACT 



FACET 

TACFTEO ClASSIF. 
FACT RETRIEVAL 
♦r ACTOR ANALYSIS 

S STAT. NETHOD 
FALSE OROF 
FEFOBACR 
FILE 

SA LIST 

SA STRING 
FILE ORGANUATin 
FLCM OF INFQ. 
FORNAT 
♦FORTRAN 

S FROG* LANGUAGE 
FREQUENCY 
FUNCTION 

SN OFERATItNAL* NCT 
MATHNATICAL 



GENERAL 
GENFRATION 

SN FROOUCTICN 
GENERIC 
♦GOAL 

S ORJECTIVE 
nnVFRNNENT 
fttAMNAR 
GRAFH 

SN MATHE<tATICAL GRAFN 
SA TABLE 
GRAFHICS 

SN 6RAFHIC MATERIAL! E."* 

FHorcs. 



♦GRQUF 

S CLUNF 



HARDMRE 

SN C0I4RUTERS* HfCRnFUH 

EQUIFKENT, FTC. 
SA HECHANICAL 
♦HEAOnCS 

S SUtJECT HEADING 
HIERARCHY 
HISTORICAL 
♦HtlMAN 

S MANUAL 
♦HUMAN INCEXINC 

S MANUAL lf«OEXING 



«IOENTICAL 

IDENTIFICATION 

ILLUSTRATION 
♦IMFLENENTATIRK 

INOEFENDENT 

INDEX 

INDEXING 

INFERENCE 

INFO. RETRIEVAL 

INFO. SCIENCE 

INFORMATION 

INFUr 
♦INQUIRER 

S USER 
♦INQUIRY 

S QUESTION 
♦INSTRUCTION 

S EDUCATION 

INTELLECTUAL 

INTEROISCIFLIKAR 

INTERFACE 

INTERFRET 
♦INTERROGATE 

S QUESTION 
♦INTERSECTION 

S VENN DIAGRAM 

INTROOUCTCRY 

INTUITIVE 

INVENTORY 
* INVERT ED 

IRRELEVANT 
♦ ITEM 

S DOCUMENT 
ITERATIVE 

SA RECURSIVf 
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JQUf^NAL 

SA OOCU'^ENT 



KEYPUNCH 
KEYWORD 

SA OESCRIPTCR 

SA TAG 

SA TERN 

KHIC 



LANGUAGE 
LARGE 
LATTICE 
LAM 
♦LEVEL 

S DEGREE 
♦LEXICAL 

S ALPHABETIC 
♦LEXICDN 

S DICTIONARY 
LIBRARIAN 
. LIBRARY 
LINGUISTIC 
LINK . 
LIST ^ 

SA FILE 

Sa STh iNG 
LITERATURE 
LOGIC 
LOGICAL 

SA BOOLEAN 



♦MACHINE 

S HARDWARE 
NACHINE-REAOABLE* 
♦ MAGNETIC TAPE 

S STORAGE 
MllN-MACHINE 
MANUAL 

MANUAL INDEXING 
MATCH 

MATHEMATICAL 
MaTHE««AT ICS 

SA PROBABILITY 
^•ATRIX 
MEANING 
MEASURE 
MPCHANICAL 

SA HARDWARE 
MECHANIZATION 

SA AUTOMATION 
••EDIUH 



•'EETING 

SA COLLOQUIUM 
SA CONFERENCE 
SA SYMPOSIUM 
♦MEMORY 

S STORAGE 

METHCODLDGY 
♦ METR IC 

S MEASURE 

MICROFICHE 

••I CROP ILM 

MODEL 

SA SIMULATION 
MODIFICATION 
MULTIPLE 



NATIONAL 
NATURAL 

NATURAL LANGUAGE 

NEEDS 

NETWORK 

SN ORGANIZATIONAL STRUCTURE 
SA ORGANIZATION 

NOISE 
♦NOMENCLATURE 

S NOTATION 
NON-CONVENTIONAL 
nON-OI SCk i Ml nAnT 
NON-FILE 
NON-RANDGM 
NON-RELEVENT 
♦ NORMALIZED 

SA CANCKICAL 
NOTATION 

SA TERMINOLOGY 
NUMBER 
NUMERIC 



OBJECTIVE 

^Sri GDALf NOT AS OPPOSED 
TO SUBJECTIVE 
♦OCCURRENCE 
OFF-LINE 
ON-LINE 
OPERATION 
OPTIMIZATION 
ORDER 

ORGANI ZATIDN 

SA NETWORK 
OUTPUT 



♦ PAIR 

S WORD ASSOCIATION 
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S nnCUHfrNT 
Si VARIABir. 

PARSE 
PATENT 

. SA CLAIM 

SA COPYRIGHT 
PATTERN 
PERrnRMANCE 
^►PERIOOICAL 

S JOURNAL 
PERMUTED 
PERTINENT 

SA RELEVANT 
PHILOSOPHY 

SA POLICY 

4^ PHOTO 

S GRAPHICS 
PLANNING 

SA DESIGN 

♦PLOT 

S GRAPH 
♦POLICY 

SA PHILCSCPHY 

♦ PHPULATI ON 

S COLLECT ION 
wtClbl UN 
PREDICTION 
♦PRINCIPLE 

♦ PRINT-OUT 

-~ S OUTPUT. 

PRINTING 
♦PRIVACY 

S SFCI^ECY 
PROBABILITY 

SA MATHEMATICS 
PROCEDURE 

RDCEEOINGS 
PROCESSING 
PROFILE 
PROG* LANGUAGE 
PROGRAM 

SN COMPUTER PRCGRAH 
SA ROUTINE 
SA SOFTHAPE 
SA SUBROUTINE 
PROGRAMMED 
♦PROPERTY 

^S CHARACTERISTIC 
PSYCHOLOGY 
♦PUBLICATION 

S POCUMENT 
PUNCHED 
♦PUNCMED-CARO 

S STORAGE 



PUNCTUATION 
♦PURPOSE 

S OBJECTIVE 



QUALITATIVE 

SA SUBJECTIVE 
QUANTITATIVE 
♦QUERY 

S QUESTION 
QUESTION 

. SN BOTH NCUN AKC VERB 
OUESTI ON-ANSMER. 



RANDOM 

RANDOM-ACCESS 
RANK 
READING 
REAL-TIME 
RECALL 
RECOGNITION 
RECORD 
♦RECORDED INFO* 

S RECORC 
RECURSIVE 

SA ITERATIVE 

REFERENCE 
♦REJECT ION 
RELATED 
RELATIONSHIP 
RELATIVE 
RELEVANCE 
RELEVANT 

SA PERTINENT 
♦REMOTE TELETYPES 

S REMOTE TERMINAL 
REMOTE TERMINAL 

SA . VISUAL DIS« CON« 
♦REPORT 

S DOCUMENT 
♦REQUEST 

S QUESTION 
RESEARCH 
♦RESPONSE 

S ANSWER 
RESPONSE TIME 
RETRIEVAL 
RETRIEVAL SYSTEM 
REVIEW 

SA SUMMARY 

SA SURVEY 

ROLE 
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ROUTINE 
SN 
SA 
-SA 
SA 



COMPUTER OCUTINE 

PROGRAM 

SOFTWARE 

SUBROUTINE 



SAMPLE 
SCANNING 
SCIENTIFIC 
SCOPE NOTE 
SEARCH CRITERIA 
SEARCH STRATEGY 
SEARCHING 

♦ SECRECY 
SEE ALSO 

SN AS USFD 
SEE-REFERENCE 
SELECTION 
SELECTIVE niSSEM 
SEMANTIC 

SA SYNTAX 
SEQUENCE 

♦ SERIAL 

S JOURNAL 
SERVICE 
SET THEORY 
StTS 

SHELFLIST 

SIGNIFICANCE 

SIMULATION 

SA MODEL 

SIZE 
SMALL 
. SOCIAL IMPLIC, 
SOFTWARE 
SA 



IN CATALOGING 



PROGRAM 
ROUTINE 
SUBROUTINE 



SA 

SA 
SORTING 
SOURCE 
SPECIALIZED 
SPECIFICITY 
STANOAROIZATICN 
STAT ASSOCIATION 
STAT. ANALYSIS 

SA STAT. METHOD 
STAT, METHOD 

SA STAT. ANALYSIS 
STATE-OF-THE-ART 
STATISTICAL 
♦ STOCHASTIC 

S RANDOM 
STORAGE 



STRING 

SA FILE 

SA LIST 
STRUCTURE 
SUBJECT 

SUBJECT HEADING 
SUBJECT INDEXING 
SUBJECT-CATALCG. 
♦ SUBJECTIVE 

SA QUALITATIVE 
SUBROUTINE 



SA 


PROGRAM 


SA 


ROUTINE 


SA 


SOFTWARE 


SUMMARY 




SA 


REVIEW 


SA 


SURVEY 


SURVEY 




SA 


REVIEW 


SA 


SUMMARY 


SYMBOL 




SYMBOLIC 


LOGIC 



COLLOQUIUM 
CONFERENCE 
MEETING 



SYMPOSIUM 

SA 

SA 

SA 
SYNONYM 

SYNTACTIC ANAL. 

SVNTaK 

SA SEMANTIC 
SYSTEM 



TABLE 



TAG 



SA GRAPH 

SA DESCRIPTOR 
SA KEYWORD 
SA TERM 

♦TAPE 

S STORAGE 
♦TEACHING 

S EDUCATION 
TECHNICAL 
TECHNICAL REPORT 
TECHNOLOGY 
TELEGRAPHIC ABS. 
TERM 

SA DESCRIPTOR 

^SA KEYWORD 

SA TAG 
♦TFRMINAL 

S REMOTE TERMINAL 
TERMINOLOGY 

SA NOTATION 
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T?rST 

SA f:V4LUATinN 
SA UTILITY 
SA VALUE 

TEXT 

THEORY 

THESAURUS 

SA AUTHORITY LIST 

TIME 

TIME-SHARING 
^ TITLE 
♦TOPIC 

S SUBJECT 
TRANSFORM AT ICK 
TRANSLATION 
♦T9ANSL ITERATION 
TRANSMISSION 
TREE 

TREE STRUCTURE 

TRUNCATION 
♦TYRE STYLE 

TYPE-SETTING 
♦TYPOGRAPHICAL 



MfAWJ INDEXING 
UORO 

UCRO ASSCCIATICN 
WORD FREQUENCY 
♦WORD PAIRS 

S UORO ASSOCIATION 



«^UNION 

SN SET THEORY UNION 

S VFNN CfAC^^AH 
♦UNION CATALOG 
♦^UNITERM 

S DESCRIPTOR 
UNITERM SYSTEM 

SA COORDINATE INDEX 
UPDATING 
USER 
UTILITY 

SA EVALUATION- 

SA TEST 

SA VALUE 



VALIDATION 
VALUE 

SA EVAUATION 

SA TEST 

SA UTILITY 
VARIABLE 

SA PARAMETER 
VECTOR 

VENN DIAGRAM 
♦ VISUAL DIS« CCh. 

SA REMOTE TERMINAL 
VOCABULARY 



WEIGHT 
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DOCUMENT DESCRIPTOR LISTING (98) 



L ^ L 



A013101LOACCESS 

A013I02L00ATA 

A013i03LDLIST 

AOI31tHLOMOG. LANGUAGE 

A013105L0STIIING 

A013106L0VAIIIA6LE 

AO13201L0ACCFS$ 
AO13202L0C0NTEXT 
A013203L0GIIAHNAII 
A0132(HL0NATIIRAL LANGUAGE 
AOl3205L0ftfLEVANT 
AO13206L0SYNTACTIC ANAL. 
A013207LCTMNS^CIIMAT ION 

AO13301L0AflSrRACTINC 
AO13M2L0C0NFFIIENCE 
AO1330U0LINGUIST1C 
A0133<KLOPARSE 
AO13305L0SYNB0LIC LOGIC 

A013^0ILOALGOII|THN 

A0134a2L0INTERPIIET 

AO13M3L0NOISF 

AO13404L0IIE0UNOANCY 

A013405L0SYSTM 

AO13501L0ACCE^S 

Ai>M^o:»i nnnni«<ENT 

AO13503L0LIBRAftlAN 
AO13$04L0IIESEAIICH 
AO 1 350SL0T ECHNOLOG Y 

A013601L0ACQUISiri0N 

A013602L0LIBRARY 

A013603LDRETRIEVAL 

A0l37niL0ACCESSI0N NUMBER 
AD137D2L0RETRIEVAL 

BO612OILDAUTQ ABSTRACTING 
fl00120?LDLINGUI.STIC 
^00 1 203L0T R ANSL AT ION 

B001301LDABSTRACTING 
B00130?LOOICT lONARY 
B001303L0LIBRARY 

BOO1401L000CUMENT 
B001402L0$CANNING 

BOOlSOlLOAUTOMATinN 
B001502L0INFa. RFTRIFVAL 
R001503L0QUESTICN 



ALGOP ITHN 
FILE 

NOTATION 
PROGRAM 

STRUCTURE 



ALGORITHM 
DATA 

INFO. RETRieVAL 
OUTPUT 
SfrM ANTIC 
SYNTAX 



ASSIGNED 

INFORMATION 

OPERATION 

SETS 

SYNTAX 



COMMUNICATION 

FiCM OF INFO. 

INFORMATION 

PARSE 

STORAGE 

SYSTEM 



COST 

LANGUAGE 
PROCEDURE 
STORAGE 
SYSTEM 



COMPUTER 

GENERATION 

INTERPRET 

QUESTION-ANSMER 

SURVEY 

TIME-*SHARING 



ALGORITHM 


ANALYSIS 


COMP LINGUISTICS 


EDITING 


EVALUATION 


INFO. RETRIEVAL 




LOGIC 


MATCH 


NATURAL LANGUAGE 


PRDG. LANGUAGE 


PROGRAM 


QUESTION-ANSMER 




TFCHNICAL 


TIME-SHARING 


TRANSLATION 




COMPUTER 


CONFERENCE 


ERROR 




MAN-l'ACHINe 


MATHEMATICAL 


NATURAL LANGUAGE 


NOTATION 


PROG. LANGUAGE 


PROGRAM 




SEMANTIC 


SOFTWARE 


SYNTAX 




TRANSLATION 


USER 


tIDRO 




BIBLIOGRAPHY 


CENTERS 


CIRCULATION 




FLOW OF INFO. 


GENERAL 


INFO. RETRIEVAL 




LIBRARY 


MECHANIZATION 


RfeMOIfc ifcKMtriAL 




SCIENTIFIC 


SEARCHING 


SERVICE 




TRANSMISSION 








ANALYSIS 


CIRCULATION 


COMMUNICATION 




MEASURf 


MEETING 


PATTERN 




SERVICE 


SYSTEM 






BOOR 


CLASSIFICATION 


LIBRARY 




SIfE 


SUBJECT 






BIBLIOGRAPHIC 


COMMUNICATION 


LANGUAGE 




NATURAL 


STORAGE 


SYSTEM 




ASSOCIATION 


CLASSIFICATION 


DATA 




FREQUENCY 


INDEX 


INFORMATION 




LITERATURE 


MICROFILM 


NETWORK 




INDEXING 


INFO. RETRIEVAL 


MICROFILM 




STORAGE 


TERM 


TRANSLATION 




COMMUNICATION 


DISSEMINATION 


DOCUMENT 




INFORMATION 


INPUT 


OUTPUT 




RETRIEVAL 


SIGNIFICANCE 


THESAURUS 
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Appendix C 

SAMPLE DATA BASE CHARACTERISTICS 

0 Term Frequency of Use Listing 

0 Depth of Indexing Listing 

. 0 Term* Document Matrix in Condensed 
Array Format 
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TERM FREQUENCY OF USE FOR SAMPLE DATA BASE 



lenn 


use 


Term 


Use 


Term 


Use 






51 


4 


102 


13 


1 


0 


52 


3 


103 


3 


2 


4 


53 


0 


104 


15 


3 


1' 


54 


6 


105 


2 






55 


20 


106 


1 


5 


0 


56 


1 


107 


0 


6 


0 




5 


108 


32 


7 


c 


58 


7 


109 


1 


8 


2 


59 


8 


110 


0 


^ 9 


0 


60 


6 


111 


2 


10 


5 


61 


5 


112 


3 


11 


9 


62 


9 


113 


4 


12 


2 


63 


5 


114 


10 


13 


0 


64 


0 


• 115 


3 


14 


0 


65 


1 


- 116 


3 


13 


0 


66 


10 


117 


3 


16 


2 


67 


1 


118 


26 


17 


1 


68 


4 


119 


12 


18 


14 


69 


27 


120 


1 


19 


2 


70 


7 


121 


1 


20 


0 


71 


0 


122 


3 


21 


3 


72 


0 


123 


1 


22 


1 


73 


1.^ 


124 


3 


23 


0 


74 


1 


125 


5 


24 


0 


75 


5 


126 


8 


25 


12 


76 


4 


127 


2 


26 


3 


77 


3 


128 


3 


27 


1 


78 


2 


129 


1 


28 


1 


79 


3 


130 


11 


29 


3 


80 


1 


131 


3 


30 


8 


81 


0 


132 


2 


31 


12 


82 


4 


133 


1 


32 


1 


83 


10 


134 


4 


33 


0 


84 


0 


135 


0 


34 


5 


85 


6 


136 


6 


35 


4 


86 


6 


137 


8 


36 


1 


37 


1 


138 


0 


37 


4 . 


88 


1 


139 


8 


38 


5 


89 


3 


140 


7 


39 


1 


90 


3 


141 


1 


. 40 


"0 


91 


1 


142 


0 


41 


1 


92 


tm 


143 


0 


42 


1 


93 


2 


144 


3 


43 


3 


94 


2 


145 


0 


44 


t 


95 


5 


146 


0 


45 


5 


96 


L 


147 


15 


* 46 


2 


97 


2 


148 


26 


4T ' 


' 0 ^ 


98 


0 


149 


3 


48 


2 


99 


6 


150 


21 


49 


2 


100 


1 


151 


4 


50 


1 


101 


1 


152 


22 
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Term 


Use 


Tenti 


Use 


Temi 


Use 


153 


14 


204 


4 


255 


1 


154 


2 


203 


3 


256 


1 


155 


1 


206 


0 


257 


13 


156 


1 


207 


1 


256 


2 


157 


1 


208 


1 


259 


5 


158 


4 


209 


1 


260 


1 


159 


1 


210 


1 


261 


1 


160 


1 


211 


C 


262 


8 


161 


1 


212 


4 


263 


0 


162 




213 


1 


264 


0 


163 


. 1 


214 


1 


265 


15 


164 


4 


215 


1 


266 


2 


165 


0 


Z16 


0 


267 


20 


166 


6 


217 


1 


268 


12 


167 


4 


218 


3 


269 


4 


168 


13 


219 


3 


270 


8 


lfe9 


1 


220 


1 


271 


0 


170 


5 


221 


5 


272 


21 


171 


0 


222 


1 


273 


14 


172 


2 


223 


5 


274 


0 


173 


8 


224 


0 


275 


3 


174 


-3 


225 


2 


276 


0 


175, 


2 


226 


0 


277 


1 


176 


3 


227 


2 


273 


2 


177 


7 


228 . 


7 


fr-^^ 279 


3 


178 


3 


229 


3 


280 


7 


179 


1 


230 


r 


281 


0 


180 


1 


231 


1 


282 


2 


181 


4 


232 


0 


253 


9 


182 


3 


233 


10 


284 


24 


183 


1 


234 


5 


285 


1 


184 


6 


235 


0 


286 


0 


185 


6 


236 


0 


287 


0 


186 


5 


237 


9 


288 


r 


1 AT 


11 




4 


281 


1 


168 


3 


239 


4 


290 


13 


189 


17 


240 


8 


291 


4 


190 


2 


241 


2 


292 


3 


191 


3 


242 


3 


293 


1 


192 


1 


243 


8 


294 


4 


193 


0 


244 


0 


295 


0 


194 


4 


245 


1 


296 


1 


195 


0 


246 


2 


297 


0 


196 


1 


247 


0 


298 


0 


197 


6 


248 


1 


290 


1 


198 


1 


249 


3 


300 


1 


190 


0 


250 


17 


301 


3 


200 


1 


251 


5 


302 


1 


201 


1 


252 


4 


303 


3 


202 


7 


253 


1^ 


304 


0 


^ 203 


2 


254 


6- 


305 


3 
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Term 


Use 


Term 


Use 




X 




1 J 

IC 


CI A 7 


IL 




C. 


^ AO 


t\ 
Sj 




•J ^ - 


uno 




^*^€. 


O 


•J 1 

^ iU 


i 




A 


'x^ \ 
jki 


1 A 




C 
I> 






^•t */ 


A 

w 




1 
1 




1 . 

4 


314 


12 


347 


3 


315 


6 


348 


3 


316 


4 


349 


0 


317 


6 


350 


0 


318 


0 


351 


0 


IXB 


1 


352 


0 


320 


0 


353 


0 


32i 


3 


354 


6 


322 


7 


355 


0 


323 


1 


356 


11 


324 


3 


357 


4 


325 


3 


358 


1 


326 


6 


359 


6 


121 


6 


360 


2 


328 


22 


361 


2 


329 


2 


362 


0 


330 


2 


363 


0— 


331 


5 


364 


5 


332 


1 


365 


7 


333 


5 


366 


4 


334 


0 


367 


7 


335 


1 


368 


9 


336 


5 


369 


3 


337 


4 






338 


9 







er|c . I 

J -A 
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DEPTH OF INDEXING DISTRIBUTION 



Depth of Depth of 

Document Indexing Document Indexinq 



1 


27 


52 


37^ 


2 


11 


53 


12 


3 


13 


54 


14 


4 


14 


55 


19 


5 


10 


56 


3 


6 


lo 


57 


8 


7 


12 


53 


IC 


8 


16 


59 


12 


9 


12 


60 


10 


10 


15 


61 


10 


11 


16 


62 


16 


12 


15 


6," 


8 


13 


^6 




17 


14 


12 


65 


12 


13 


15 


66 


15 


16 


1 9 

\c 


67 


2 


17 


11 


68 


15 


lo 


1 A 
lU 


69 


23 


19 


11 


70 


17 


20 


1 


71 


12 


21 


19 


72 


10 


22 


16 


73 


14 


23 


24 


74 


9 


24 


21 


75 


9 


25 


15 


76 


14 


26 


13 


77 


10 


27 


14 


78 


11 


CO 


19 


79 


14 


29 


13 


80 


15 


•30 


24 


81 


16 


31 


18 


82 


8 


32 


12 


83 


11 


33 


17 


84 


11 


34 


18 


85 


14 


35 


14 


86 


15 


36 


14 


87 


9 


37 


25 


88 


21 


33 


8 


89 


14 


39 


10 


90 


14 


40 


12 


91 


2 


41 


14 


92 


14 


42 


3 


93 


20 


43 




94 


12 


44 


17 


95 


19 


. .45 


26 


96 


' 15 


46 


13 


97 


19 


47 


It 


98 


12 


43 


34 


99 


13 


49 


22 


100 


3 


50 


IS 


101 


16 


51 


10 


102 


13 
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TERM - DOCUMENT MATRIX FOR SAMPLE DATA 
BASE - IN CONDENSED ARRAY FORM 



XXX mm- !ffwnnrnr i \u 

Interpret as document XXX Is assigned descHptor ZZZ. 





1 18 


1 


2 29 


1 


3 30 


1 


4 78 


1 


5 79 


I 


6 86 


1 


7 91 


1 


O Ave 


1 


9118 


1 


10120 


1 


11125 


1 


12130 


1 


13148 


I 


14149 


1 




1 
1 


1 Al 70 
AO A f V 


1 


17177 


1 


18185 


1 


19191 


1 


20262 


1 


21267 


I 


22285 


1 
& 


23290 


1 
& 


94391 


1 


25322 


1 


26338 


1 


27343 


2 


1 34 


2 


2 57 


2 


3 68 


2 


4 88 


2 


^ wV 


2 


6108 


2 


7118 


2 


8144 


2 


9248 


2 


10265 


2 


11336 


3 


1 «ft 




9 94 


3 


3 62 


3 


4 70 


3 


5 85 


3 


6118 


3 


7119 


3 


8149 


3 






lOlftS 


3 


11186 


3 


12314 


3 


13356 


4 


1 30 


4 


2 55 


4 


3 69 




» • ^ 








6119 


4 


7148 


4 


8149 


4 


9166 


4 


10265 


4 


11290 








& ^ J & 9 




14366 


5 


1 34 


5 


2 51 


5 


3108 


5 


4130 


5 


5144 


5 


6204 


5 


7227 


5 


8265 


5 


9280 


5 


10311 


6 


1 25 


6 


2 31 


6 


3104 


6 


4118 




9 & lb 9 


6 


6137 


6 


7140 


6 


8148 


6 


9153 


6 


10189 


6 


11198 


6 


12220 




13233 


h 


U257 


6 


15267 


6 


16272 


6 


17284 


6 


18311 


7 


1 86 


7 


2 97 


7 


3 99 


7 


>1U 


7 


5118 


7 


6124 


7 


7150 


7 


8189 


7 


9267 


7 


10273 


7 


11328 


7 


12359 


8 


1 32 


8 


2 55 


8 


3094 


8 


4112 


8 


5131 


8 


6168 


8 


7173 


8 


8215 


8 


9228 


8 


10231 


8 


11270 


8 


12300 


8 


13311 


8 


14322 


8 


15333 


8 


16357 


9 


1 25 


9 


2 48 


9 


3 62 


9 


4 85 


9 


5 99 


9 


6189 


9 


7249 


9 


8252 


9 


9265 


9 


10311 


9 


11331 


9 


12360 


10 


1 61 


10 


2 66 


10 


3 ¥6 


10 


4115 


10 


5117 


10 


6126 


10 


7152 


10 


8186 


10 


9189 


10 


10205 


10 


11237 


10 


12268 


10 


13254 


10 


14291 


10 


15338 


11 


1 18 


11 


2 31 


11 


3 55 


11 


4 62 


11 


5 76 


11 


6 83 


11 


7105 


11 


8108 


11 


9130 


11 


10131 


11 


II367 


11 


12311 


11 


13356 


11 


14359 


11 


15365 


11 


16367 


12 


1 21 


12 


2 31 


12 


3 55 


12 


4 57 


12 


5 58 


12 


6 59 


12 


7 62 


12 


8 95 


12 


9118 


12 


10170 


12 


1 1 187 


12 


12243 


12 


13272 


12 


14338 


12 


15339 


13 


1 2 


13 


2 11 


13 


3 19 


13 


4 69 


13 


5 93 


13 


6106 


13 


7108 


13 


8125 


13 


9130 


13 


10137 


13 


11152 


13 


12164 


13 


13177 


13 


14184 


13 


15194 


13 


16205 


13 


17237 


13 


18241 


13 


19273 


13 


20259 


13 


21296 


13 


22329 


13 


23359 


13 


24365 


13 


25366 


13 


26368 


14 


1100 


14 


2102 


14 


3108 


14 


4119 


14 


5137 


14 


6188 


14 


7189 


14 


8233 


14 


9267 


14 


10272 


14 


11280 


14 


12283 


15 


1 31 


15 


2 49 


15 


3 55 


15 


4119 


15 


5126 


15 


6139 


15 


7152 


15 


8176 


15 


9250 


15 


10253 


15 


11256 


15 


12269 


15 


13284 


15 


14328 


15 


15341 


16 


1 70 


16 


2 83 


16 


3118 


16 


4119 


16 


5175 


16 


6189 


16 


7233 


16 


8257 


16 


9267 


16 


10272 


16 


11275 


16 


12283-17 


1 10 


17 


2 38 


17 


3 55 


17 


4108 


17 


5150 


17 


6170 


17 


7185 


17 


8189 


17 


9267 


17 


10272 


17 


I12I2 


18 


1 18 


18 


2 51 


18 


3 52 


18 


4 60 


18 


5177 


18 


6230 


18 


72^5 


18 


8280 


18 


9342 


18 


10356 


19 


1 18 


19 


2 48 


19 


3 69 


19 


4115 


19 


5152 


19 


6185 


19 


7189 


19 


8197 


19 


9237 


19 


10338 


19 


11339 


20 


1 31 


20 


2 49 


20 


3 69 


20 


4153 


20 


5243 


20 


6283 


20 


7284 


21 


1 41.21 


2 54 


21 


3 60 


21 


4108 


21 


5109 


21 


6124 


21 


7130 


21 


8150 


21 


9152 


21 


10212 21 


11221 


21 


12246 


21 


13250 


21 


14252 


21 


15268 


21 


16284 


21 


17291 


21 


18312 


21 


19328 


22 


1 21 


22 


2 57 


22 


3 86 


22 


4114 


22 


S127 


22 


6140 


22 


7150 


22 


8169 


22 


9197 


22 


10219 


22 


11237 


22 


12249 


22 


13270 


22 


14284 


22 


15340 


22 


16347023 


1 11 


23 


2 18 


23 


3 31 


23 


4 55 


23 


5 57 


23 


6 58 


23 


7 69 


23 


8104 


23 


9108 


23 


10118 


23 


1*148 


23 


12152 


23 


13166 


23 


14168 


23 


15202 


23 


16223 


23 


17240 


23 


18242 



204 



5^ 

23 


19267 




20272 


23 


21273 


23 


9 9 9 1 #\ 

22310 


23 


23311 


23 


24321 


9 «* 

32 


25326 


23 


26339 


23 


27341 


23 


28344 


24 


1 26 


24 


2 .82 


24 


3 86 


24 


4 95 


24 


5114 


24 


6118 


2^. 


7119 


24 


8147 


24 


9150 


24 


10152 


24 


11184 


24 


12197 


24 


132 3 


24 


14228 


24 


15233 


24 


16257 


24 


17283 


24 


18284 


24 


19328 


24 


20356 


24 


21357 


25 


1 25 


25 


2 30 


25" 


^3 69 


25 


4108 


25 


5111 


25 


6118 


25 


7148 


25 


8153 


25 


9189 


25 


1026B 


25 


11272 


25 


12257 


25 


13284 


25 


14328 


25 


15365 


26 


1 25 


26 


2 59 


26 


3130 


26 


4137 


26 


5147 


26 


6185 


26 


7189 


26 


8233 


26 


9254 


26 


10268 


26 


11272 


26 


12311 


26 


13339 


27 


1 4 


27 


2 79 


27 


3104 


27 


4132 


27 


5136 


27 


6150 


27 


7174 27 


8202 


27 


9260 


27 .10328 


27 


11338 


27 


12343 


27 


13356 


27 


14357 


28 


1 10 


28 


2 28 


28 


3 66 


28 


4137 


28 


5152 


28 


6186 


28 


7187 


28 


6204 


28 


9265 


28 


10293 


28 


11314 


28 


12331 


28 


13338 


29 


1 4 


29 


2 66 


29 


3 69 


29 


4126 


29 


5158 


29 


6218 


29 


7219 


29 


8243 


29 


9269 


29 


10292 


29 


11328 


29 


12341 


29 


13357 


30 


1 4 


30 


2 11 


30 


3 66 


30 


4 69 


30 


5 75 


30 


6128 


30 


7133 


30 


8136 


30 


9150 


30 


10152 


30 


III57 


30 


12202 


30 


13223 


30 


14225 


30 


15251 


30 


16268 


30 


17290 


30 


18312 


30 


19321 


30 


20326 


30 


21327 


30 


22328 


30 


23341 


30 


24343 


31 


1 4 


31 


2 35 


31 


3 46 


31 


4 50 


31 


5108 


31 


6128 


31 


7132 


31 


8150 


31 


9172 


31 


10173 


31 


11191 


31 


12269 


31 


13270 


31 


14280 


31 


15284 


31 


16292 


31 


17333 


31 


18346 


32 


1 3 


32 


2 25 


32 


3 55 


32 


4 9t> 


32 


5104 


32 


6130 


32 


7147 


32 


8152 


32 


9173 


32 


10177 


32 


11196 


32 


12204 


33 


1 18 


33 


2 38 


33 


3 55 


33 


4 61 


33 


5 75 


33 


6148 


33 


7152 


33 


8153 


33 


9168 


33 


10223 


33 


11250 


33 


12265 


33 


13275 


33 


14284 


33 


15280 


33 


16314 


33 


17322 


34 


1 18 


34 


2 31 


34 


3 61 


34 


4 63 


34 


5 66 


34 


6104 


34 


7108 


34 


8114 


34 


9147 


34 


I0I5I 


34 


11177 


34 


I2I9I 


34 


13238 


34 


14268 


34 


15279 


34 


16284 


34 


I73II 


34 


18343 


35 


1 55 


35 


2 60 


35 


3 66 


35 


4108 


35 


5147 


35 


6152 


35 


7153 


35 


8176 


35 


9223 


35 


10272 


35 
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Appendix D 



ILLUSTRATIONS OF COMPUTATIONS TO 
ESTIMATE RETRIEVAL QUANTITY 



i 



El4 



Question 1. 



Form: • Tg • (T3 + + T5) - Tg 



Term Frequencies: f(T^) = 21 

11 
20 
53 
44 
2 



f(T4) 



f(Tg) 



f(T,') = f(T, • W = (4.7) (21 . 11) ■ 2.7 
' ^ ^ 400 



fdp') = fd, + T-) = 20 + 53 - (3.5) (20 ' 53) 

400 



fdj') = fdg' + Tg) = 62 + 44 - (3.75)1 



f(T.') = f(T,' • T3') = (4)(2.7 . 80) = 2 
' 400 



f(T5') = f(T ' . Tg) = (100)(2.2) ^ 1 

' ^ ° "loo 



2 IF f(Tg') = 0 
1 IF f^g') = 1 



NOTE: All y's from Fig. 5.43. 
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Question 8. 

Form: • (Tg + T3) 

Term Frequencies: f(T^) = 20 

fdg) = 10 
fdg) = 84 

f (T, ') = f (T, + T,) = 10 + 84 - (2.8) (10.84) = 88 
' ^ 4d5 

f(T«') = f(T, . T,') = (3.4) (20 ' 88) ° 15 

400 

R„ = 15 , 
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Question 14. 



Form: • Tg 



Term Frequency: f{T^) = 38 
fdg) = 31 



